This lab walks you through GCP Dataproc, solve any mathematical problem by submitting it in the form of a Job and getting the result with utmost precison.
Duration: 1 hour
It is more used for the purpose of data analysis . If you want to use Apache Spark (used to run queries on large datasets , create a data pipeline , working with graphs and everything related to big data) , you need the in-memory cache and a good computing power along with TB/PB of memory . The solution to the above requirements in Dataproc in Google Cloud. By using Google Dataproc you can create a cluster in under 90 seconds .
The underlying stack for Dataproc are Apache Spark , Apache Hive , Apache Pig and Apache Hadoop.
Think of Cluster as a group of 1 or more computers(nodes) connected with each other in a single VPC(virtual private cloud) . The benifit of using a cluster is that it enhances the memory capacity and computational power , thus increasing the performance of the system .
As the name suggets , Job in Dataproc is defined as any work you want to assign to the cluster in dataproc , for example calcualtion of pi or reading the number of words in a document.
You won’t need to worry about losing data, because Dataproc is integrated with BigQuery and other core services.
If you are having a budget constraint,you can Scale up or down even when jobs are running .
You can even switch off the cluster when you don't need them , thus reducing the billing charges.
Creating a Cluster and Job using the Cloud Shell.
Submitting a job to the Cluster.
Updating the Cluster using the Console.
Deleting the Cluster.