Creating Batch Flow using GCS , Dataflow and BigQuery

Lab Details:

  1. This lab walks you through creating Batch Workflow or Pipeline through Cloud Storage , Cloud Dataflow and BigQuery .

  2. Duration: 60 minutes

Note: Do not refresh the page after you click Start Lab, wait for a few seconds to get the credentials.
In case while login into Google, if it asks for verification, please enter your mobile number and verify with OTP, Don't worry this Google Account will be deleted after the lab.?

What is a Batch Flow/Pipeline:

You can think Batch as a collection of similar tasks or jobs that are intended to be completed at a certain time . It is run by Human Intervention or sometime by Cloud  Scheduler in GCP . Broadly Batch Flows are used to ingest data , perform the required changes or scanning and then sending the data to the sink or the destination . The cost of running a batch flow depends on the storage used and the computational power used i.e. Nodes / Compute engine .

 Batch Flow vs Continuous/Streaming Flow:

  1. Batch Flow is cheaper in Operational Expenditure as compared to the Continuous Flow .

  2. Batch Flow enables you to perform a set of batches and execute a pipeline in which the next Batch is dependent on the previous Batch .

  3. In Batch Flow we need Human Intervention to start and stop the flow whereas in Continuous Flow it isn't required . 

  4. Streaming Flows are much efficient than the Batch Flow as they store the state of the data .

What is Dataflow :

Cloud Dataflow is a fully managed service that is totally serverless data processing service which means you just have to assign a job to it and rest dataflow will take care.Behind the scenes When you submit a job on Cloud Dataflow, it spins up a cluster(virtual machines) and distributes the tasks in your job to the VMs , furthermore it dynamically scales the cluster based on how the job is performing. 

Dataflow supports both batch and streaming jobs. It can be integrated with Pub/Sub for stream processing and with other services like BigQuery and Cloud Storage for Batch Processing.

What is BigQuery?

  • BigQuery is a fully managed big data tool for companies who need a cloud-based interactive query service for massive datasets. 

  • BigQuery is not a database, it's a query service. 

  • BigQuery supports SQL queries, which makes it quite user-friendly. It can be accessed from Console, CLI, or using SDK. You can query billions of rows, it only takes seconds to write, and seconds to return.

  • You can use its REST APIs and get your work done by sending a JSON request.

  • Let’s understand with help of an example, Suppose you are a data analyst and you need to analyze tons of data. If you choose a tool like traditional MySQL, you need to have an infrastructure ready, that can store this huge data.

  • You can focus on analysis rather than working on infrastructure. Hardware is completely abstracted.

  • Designing this infrastructure itself will be a difficult task because you will have to figure out RAM size, CPU type, or any other configurations.

  • BigQuery is mainly for Big Data. You shouldn’t confuse it with OLTP (Online Transaction Processing) database. 

Terms related to BigQuery:

  • Datasets: Datasets hold one or more tables of data.

  • Tables: Tables are row-column structures that hold actual data

  • Jobs: Operations that you perform on the data, such as loading data, running queries, or exporting data.

Lab Tasks:

  1. Create a Bucket and Upload the required Files.

  2. Create BigQuery Dataset 

  3. Create Batch Pipeline from Dataflow

  4. Analyze data in BigQuery

 



Join Whizlabs_Hands-On to Read the Rest of this Lab..and More!

Step 1 : Login to My-Account
Step 2 : Click on "Access Now" to view the course you have purchased
Step 3 : You will be taken to our Learn Management Solution (LMS) to access your Labs,Quiz and Video courses

Open Console