This lab walks you through GCP Dataflow, design any flow or pipeline as required and thus automating things to some extent.
This lab walks you through GCP Dataproc, solve any mathematical problem by submitting it in the form of a Job and getting the result with utmost precison.
When to use Dataproc and Dataflow.
Both Cloud DataFlow and Cloud Dataproc are used for Data Processing and Analytics.
Cloud Dataflow is a fully managed service that is totally serverless data processing service which means you just have to assign a job to it and rest dataflow will take care.Behind the scenes When you submit a job on Cloud Dataflow, it spins up a cluster(virtual machines) and distributes the tasks in your job to the VMs , furthermore it dynamically scales the cluster based on how the job is performing.
Dataflow supports both batch and streaming jobs. It can be integrated with Pub/Sub for stream processing and with other services like BigQuery and Cloud Storage for Batch Processing.
No need to think about the resources as it is serverless and manages the resources by itself
No need to think about the Performance as Dataflow automatically optimizes the performance of the job by automatic rebalancing the workload.
Data is encypted both in motion and in rest.
What is Dataproc ?
It is more used for the purpose of data analysis . If you want to use Apache Spark (used to run queries on large datasets , create a data pipeline , working with graphs and everything related to big data) , you need the in-memory cache and a good computing power along with TB/PB of memory . The solution to the above requirements in Dataproc in Google Cloud. By using Google Dataproc you can create a cluster in under 90 seconds .
Dataproc is a managed Spark and Hadoop service that is used for batch processing and machine learning.As it is not a fully managed service so you need to manage the creation of clusters , scaling the workers up and down all by yourself .It's Ideal for Lift and Shift migration of existing Hadoop environment .
The underlying stack for Dataproc are Apache Spark , Apache Hive , Apache Pig and Apache Hadoop.
You won’t need to worry about losing data, because Dataproc is integrated with BigQuery and other core services.
If you are having a budget constraint,you can Scale up or down even when jobs are running .
You can even switch off the cluster when you don't need them , thus reducing the billing charges.
It's very Easy to Use as you can easily create the clusters and submit a variety of jobs through the Google Cloud Console, the Cloud SDK, or the Dataproc REST API.
With less time and money spent on administration, you can focus on your jobs and your data.
Dataflow vs Dataproc :
Lab Tasks :