Masking PII Data into BigQuery using Dataflow

Lab Details:

  1. This lab walks you through the integration of DLP API with Dataflow.

  2. You will be creating a streaming pipeline and Masking the data with DLP.

  3. Region: us-central1

  4. Duration: 1 hour 15 minutes

Note: Do not refresh the page after you click Start Lab, wait for a few seconds to get the credentials.
In case while login into Google, if it asks for verification, please enter your mobile number and verify with OTP, Don't worry this Google Account will be deleted after the lab.

What is Cloud DLP?

GCP Data Loss Prevention API, a SAAS (Software as a Service) helps you to de-identify and protect the sensitive figures. It reduces the potential risk and takes informed decisions to prevent data loss. DLP API uses information types also known as infoTypes which determine the sensitive content.
Let's say the end-user enters personal information like email address, credit card details, and password in an application. These are known as built-in infoTypes which we use in DLP API. To avoid threats, we can choose the DLP API to mask the data. There are multiple ways to implement this -

  • Inspect  - which predicts that the given information contains sensitive information. Example: CC number - 4111-8888-8888-8118
  • Mask - In order to hide this sensitive content, DLP API will mask each instance of data with some special character. Example: CC number - ####-####-####-####
  • Replace - Replacing the data using a predetermined key i.e. replace the text with a fixed string value. Example: CC number - [CreditCardNumber]
  • Redact - Hiding the sensitive content completely. Example: CC number -

What is Dataflow?

Cloud Dataflow is a fully managed service that is totally serverless data processing service which means you just have to assign a job to it and the rest dataflow will take care. Behind the scenes When you submit a job on Cloud Dataflow, it spins up a cluster(virtual machines) and distributes the tasks in your job to the VMs, furthermore, it dynamically scales the cluster based on how the job is performing. 

Dataflow supports both batch and streaming jobs. It can be integrated with Pub/Sub for stream processing and with other services like BigQuery and Cloud Storage for Batch Processing.


What is BigQuery?

The top driving factor for customers to Migrate to Google Cloud Platform is Google's High-end, fully managed Data Analytics services. One of the advanced services offered by Google is BigQuery which is a Serverless, Highly Scalable Data warehouse service. With a Huge Volume of Data Generation through various sensors, Applications there is a need for a low-cost Data Warehouse to store the data and analyze it when needed. With BigQuery you can analyze Petabytes of data within minutes. BigQuery stores data in a columnar format, achieving a high compression ratio and scan throughput. BigQuery is fully efficient when used with not changing Dataset for example it can be used at the end of an ETL pipeline to enrich/analyze the data. For Daily OLTP workload or sensor data, you should opt for other services such as Cloud SQLCloud Spanner, and Cloud BigTable.You are charged for the data you store and the bytes read to give you the Query output. You can refer to the Introduction to BigQuery Lab to explore more about Google Cloud BigQuery.

Advantages of using BigQuery:

  1. 0 Overhead Cost of running the operations on BigQuery.

  2. Very Less TCO(Total Cost of Ownership).

  3. Integrated with BigQuery ML Create ML Models.

  4. Supported by Data Studio to visualize the analyzed data.

  5. Addition with BigQuery Omini which is a Multi-Cloud Data Analysis service.

Disadvantages of using BigQuery:

  1. As it's a fully managed service, you have no control over where and how your data is stored.

Lab Tasks:

  1. Create BigQuery Dataset ID and Table ID

  2. Create a Cloud Storage and upload an object.

  3. Create an Identify Template

  4. Create a Job Trigger and save the data to BigQuery.

  5. Creating a De-identification Template.

  6. Create a Pipeline using Cloud Dataflow.

  7. Analyze data in BigQuery                                                                                                                                      

Join Whizlabs_Hands-On to Read the Rest of this Lab..and More!

Step 1 : Login to My-Account
Step 2 : Click on "Access Now" to view the course you have purchased
Step 3 : You will be taken to our Learn Management Solution (LMS) to access your Labs,Quiz and Video courses

Open Console