ALERT: Due to maintenance activity, you might not see any screenshots. Your patience is highly appreciated. Thanks!!
This lab walks you through the integration of DLP API with Dataflow.
You will be creating a streaming pipeline and Masking the data with DLP.
Region: us-central1
Duration: 1 hour 15 minutes
GCP Data Loss Prevention API, a SAAS (Software as a Service) helps you to de-identify and protect the sensitive figures. It reduces the potential risk and takes informed decisions to prevent data loss. DLP API uses information types also known as infoTypes which determine the sensitive content.
Let's say the end-user enters personal information like email address, credit card details, and password in an application. These are known as built-in infoTypes which we use in DLP API. To avoid threats, we can choose the DLP API to mask the data. There are multiple ways to implement this -
Cloud Dataflow is a fully managed service that is totally serverless data processing service which means you just have to assign a job to it and the rest dataflow will take care. Behind the scenes When you submit a job on Cloud Dataflow, it spins up a cluster(virtual machines) and distributes the tasks in your job to the VMs, furthermore, it dynamically scales the cluster based on how the job is performing.
Dataflow supports both batch and streaming jobs. It can be integrated with Pub/Sub for stream processing and with other services like BigQuery and Cloud Storage for Batch Processing.
The top driving factor for customers to Migrate to Google Cloud Platform is Google's High-end, fully managed Data Analytics services. One of the advanced services offered by Google is BigQuery which is a Serverless, Highly Scalable Data warehouse service. With a Huge Volume of Data Generation through various sensors, Applications there is a need for a low-cost Data Warehouse to store the data and analyze it when needed. With BigQuery you can analyze Petabytes of data within minutes. BigQuery stores data in a columnar format, achieving a high compression ratio and scan throughput. BigQuery is fully efficient when used with not changing Dataset for example it can be used at the end of an ETL pipeline to enrich/analyze the data. For Daily OLTP workload or sensor data, you should opt for other services such as Cloud SQL, Cloud Spanner, and Cloud BigTable.You are charged for the data you store and the bytes read to give you the Query output. You can refer to the Introduction to BigQuery Lab to explore more about Google Cloud BigQuery.
0 Overhead Cost of running the operations on BigQuery.
Very Less TCO(Total Cost of Ownership).
Integrated with BigQuery ML Create ML Models.
Supported by Data Studio to visualize the analyzed data.
Addition with BigQuery Omini which is a Multi-Cloud Data Analysis service.
As it's a fully managed service, you have no control over where and how your data is stored.
Create BigQuery Dataset ID and Table ID
Create a Cloud Storage and upload an object.
Create an Identify Template
Create a Job Trigger and save the data to BigQuery.
Creating a De-identification Template.
Create a Pipeline using Cloud Dataflow.
Analyze data in BigQuery