Dataflow vs Dataproc

Lab Details:

  1. This lab walks you through GCP Dataflow, design any flow or pipeline as required and thus automating things to some extent.

  2. This lab walks you through GCP Dataproc, solve any mathematical problem by submitting it in the form of a Job and getting the result with utmost precison.

  3. When to use Dataproc and Dataflow.

  4. Region: us-central1 

  5. Duration: 1hour

Note: Do not refresh the page after you click Start Lab, wait for a few seconds to get the credentials.
In case while login into Google, if it asks for verification, please enter your mobile number and verify with OTP, Don't worry this Google Account will be deleted after the lab.

Both Cloud DataFlow and Cloud Dataproc are used for Data Processing and Analytics.

What is Dataflow ?

Cloud Dataflow is a fully managed service that is totally serverless data processing service which means you just have to assign a job to it and rest dataflow will take care.Behind the scenes When you submit a job on Cloud Dataflow, it spins up a cluster(virtual machines) and distributes the tasks in your job to the VMs , furthermore it dynamically scales the cluster based on how the job is performing. 

Dataflow supports both batch and streaming jobs. It can be integrated with Pub/Sub for stream processing and with other services like BigQuery and Cloud Storage for Batch Processing.

Advantages of using Dataflow:

  1. No need to think about the resources as it is serverless and manages the resources by itself

  2. No need to think about the Performance as Dataflow automatically optimizes the performance of the job by automatic rebalancing the workload.

  3. Data is encypted both in motion and in rest.

What is Dataproc ?

It is more used for the purpose of data analysis . If you want to use Apache Spark (used to run queries on large datasets , create a data pipeline , working with graphs and everything related to big data) , you need the in-memory cache and a good computing power along with TB/PB of memory . The solution to the above requirements in Dataproc in Google Cloud. By using Google Dataproc you can create a cluster in under 90 seconds .

Dataproc is a managed Spark and Hadoop service that is used for batch processing and machine learning.As it is not a fully managed service so you need to manage the creation of clusters , scaling the workers up and down all by yourself .It's Ideal for Lift and Shift migration of existing Hadoop environment .

The underlying stack for Dataproc are Apache Spark , Apache Hive , Apache Pig and Apache Hadoop.

Advantage of using Dataproc:

  •  You won’t need to worry about losing data, because Dataproc is integrated with BigQuery and other core services.

  • If you are having a budget constraint,you can Scale up or down even when jobs are running .

  • You can even switch off the cluster when you don't need them , thus reducing the billing charges.

  • It's very Easy to Use as  you can easily create the clusters and submit a variety of jobs through the Google Cloud Console, the Cloud SDK, or the Dataproc REST API.

  • With less time and money spent on administration, you can focus on your jobs and your data.

Dataflow vs Dataproc :

  1. If you want to migrate from your existing Hadoop/Spark cluster to the cloud, have substantial investment, and already have experienced Engineers Choose Dataproc as it will help in lowering your cost
  2. If you are new to the Hadoop/Spark and you trust Google's expertise in large scale data processing then choose DataFlow .


Lab Tasks :

  1. Create a bucket and upload the sample file .
  2. Create a job in Dataflow and check the output .
  3. Creating a Cluster and databroc and submitting the job.

Join Whizlabs_Hands-On to Read the Rest of this Lab..and More!

Step 1 : Login to My-Account
Step 2 : Click on "Access Now" to view the course you have purchased
Step 3 : You will be taken to our Learn Management Solution (LMS) to access your Labs,Quiz and Video courses

Open Console