ALERT: Due to maintenance activity, you might not see any screenshots. Your patience is highly appreciated. Thanks!!
This lab walks you through creating Batch Workflow or Pipeline through Cloud Storage , Cloud Dataflow and BigQuery .
Duration: 60 minutes
What is a Batch Flow/Pipeline:
You can think Batch as a collection of similar tasks or jobs that are intended to be completed at a certain time . It is run by Human Intervention or sometime by Cloud Scheduler in GCP . Broadly Batch Flows are used to ingest data , perform the required changes or scanning and then sending the data to the sink or the destination . The cost of running a batch flow depends on the storage used and the computational power used i.e. Nodes / Compute engine .
Batch Flow vs Continuous/Streaming Flow:
Batch Flow is cheaper in Operational Expenditure as compared to the Continuous Flow .
Batch Flow enables you to perform a set of batches and execute a pipeline in which the next Batch is dependent on the previous Batch .
In Batch Flow we need Human Intervention to start and stop the flow whereas in Continuous Flow it isn't required .
Streaming Flows are much efficient than the Batch Flow as they store the state of the data .
Cloud Dataflow is a fully managed service that is totally serverless data processing service which means you just have to assign a job to it and rest dataflow will take care.Behind the scenes When you submit a job on Cloud Dataflow, it spins up a cluster(virtual machines) and distributes the tasks in your job to the VMs , furthermore it dynamically scales the cluster based on how the job is performing.
Dataflow supports both batch and streaming jobs. It can be integrated with Pub/Sub for stream processing and with other services like BigQuery and Cloud Storage for Batch Processing.
BigQuery is a fully managed big data tool for companies who need a cloud-based interactive query service for massive datasets.
BigQuery is not a database, it's a query service.
BigQuery supports SQL queries, which makes it quite user-friendly. It can be accessed from Console, CLI, or using SDK. You can query billions of rows, it only takes seconds to write, and seconds to return.
You can use its REST APIs and get your work done by sending a JSON request.
Let’s understand with help of an example, Suppose you are a data analyst and you need to analyze tons of data. If you choose a tool like traditional MySQL, you need to have an infrastructure ready, that can store this huge data.
You can focus on analysis rather than working on infrastructure. Hardware is completely abstracted.
Designing this infrastructure itself will be a difficult task because you will have to figure out RAM size, CPU type, or any other configurations.
BigQuery is mainly for Big Data. You shouldn’t confuse it with OLTP (Online Transaction Processing) database.
Datasets: Datasets hold one or more tables of data.
Tables: Tables are row-column structures that hold actual data
Jobs: Operations that you perform on the data, such as loading data, running queries, or exporting data.
Lab Tasks:
Create a Bucket and Upload the required Files.
Create BigQuery Dataset
Create Batch Pipeline from Dataflow
Analyze data in BigQuery