Saturday, July 27, 2024

Read PII Data with Google Distributed Cloud Dataproc

PII data

Due to operational or regulatory constraints, Google Cloud clients who are interested in developing or updating their data lake architecture frequently have to keep a portion of their workloads and data on-premises.

You can now completely modernise your data lake with cloud-based technologies while creating hybrid data processing footprints that enable you to store and process on-prem data that you are unable to shift to the cloud, thanks to Dataproc on Google Distributed Cloud, which was unveiled in preview at Google Cloud Next ’24.

Using Google-provided hardware in your data centre, Dataproc on Google Distributed Cloud enables you to run Apache Spark processing workloads on-premises while preserving compatibility between your local and cloud-based technology.

For instance, in order to comply with regulatory obligations, a sizable European telecoms business is updating its data lake on Google Cloud while maintaining Personally Identifiable Information (PII) data on-premises on Google Distributed Cloud.

Google Cloud will demonstrate in this blog how to utilise Dataproc on Google Distributed Cloud to read PII data that is stored on-premises, compute aggregate metrics, and transfer the final dataset to the cloud’s data lake using Google Cloud Storage.

Compile and encrypt private information on-site

The customer in Google Cloud test scenario is a telecom provider that keeps event records of user calls:

customer
id
customer namecall
duration
call
type
signal
strength
device
type
location
1<redacted>141Voice379LG Q6Tammieview, PA
2<redacted>26Video947Kyocera Hydro EliteNew Angela, FL
3<redacted>117Voice625Huawei Y5Toddville, MO
4<redacted>36Video382iPhone XRichmondview, NV
5<redacted>110Video461HTC 10 evoCowanchester, KS
6<redacted>0Video326Galaxy S7Nicholsside, NV
7<redacted>200Data448Kyocera Hydro EliteNew Taramouth, AR
8<redacted>178Data475Galaxy S7South Heather, CT
9<redacted>200Voice538Oppo Reno6 Pro+ 5GGregoryburgh, ID
10<redacted>113Voice878ZTE Axon 30 Ultra 5GKaraview, NV
11<redacted>200Data722Huawei P10 LitePetersonstad, IA
12<redacted>200Voice1HTC 10 evoWest Danielport, CO
13<redacted>169Voice230Samsung Galaxy S10+North Jose, SD
14<redacted>198Voice1Kyocera DuraForceEast Matthewmouth, AS
15<redacted>155Data757Oppo Find XTuckerchester, MD
16<redacted>0Data1ZTE Axon 30 Ultra 5GNew Tammy, NC
17<redacted>200Data656Galaxy Note 7East Jeanside, NJ
18<redacted>15Data567Huawei Y5Lake Patrickburgh, OH


PII is present in this dataset. PII needs to be kept on-site in their own data centre in order to comply with regulations. The customer will store this data on-premises in object storage that is S3-compatible in order to meet this requirement. Now, though, the customer wants to use their larger data lake in Google Cloud to determine the optimal places to invest in new infrastructure by analysing signal strength by geography.

Full local execution of Spark jobs capable of performing an aggregation on signal quality is supported by Dataproc on Google Distributed Cloud, allowing integration with Google Cloud Data Analytics while adhering to compliance standards.

The Cloud Storage output shows multiple low-quality signal areas:

LocationValue
Georgefurt, MS1.0
Scottside, MA1.0
Monroemouth, FL1.0
Lake Robert, OH1.0
East Lauren, VA1.0
Shelleyburgh, CT1.0
Buckville, ID1.0
Garzaton, WI3.32
North Danielle, NY3.99
Port Natalie, ID5.43

PII is present in this dataset. PII needs to be kept on-site in their own data centre in order to comply with regulations. The customer will store this data on-premises in object storage that is S3 compatible in order to meet this requirement. The customer now wants to analyse signal strength by location and determine the optimal places for new infrastructure expenditures using their larger data lake in Google Cloud.

Reading PII data with Google Distributed Cloud Dataproc requires various steps to assure data processing and privacy compliance.

To read PII data with Google Distributed Cloud Dataproc, just set up your Google Cloud environment.

  • Create a Google Cloud Project: If you don’t have one, create one in GCP.
  • Project billing: Enable billing.
  • In your Google Cloud project, enable the Dataproc API, Cloud Storage API, and any other relevant APIs.

Prepare PII

  • Securely store PII in Google Cloud Storage. Encrypt and restrict bucket and data access.
  • Classifying Data: Label data by sensitivity and compliance.

Create and configure Dataproc Cluster

  • Create a Dataproc cluster using the Google Cloud Console or gcloud command-line tool. Set the node count and type, and configure the cluster using software and libraries.
  • Security Configuration: Set IAM roles and permissions to restrict data access and processing to authorised users.

Develop Your Data Processing Job

  • Choose a Processing Framework: Consider Apache Spark or Hadoop.
  • Write the Data Processing Job: Create a script or app to process PII. This may involve reading GCS data, transforming it, and writing the output to GCS or another storage solution.

Job Submission to Dataproc Cluster

  • Submit your job to the cluster via the Google Cloud Console, gcloud command-line tool, or Dataproc API.
  • Check work status and records to guarantee completion.

Compliance and Data Security

  • Encrypt data at rest and in transit.
  • Use IAM policies to restrict data and resource access.
  • Compliance: Follow data protection laws including GDPR and CCPA.

Destruction of Dataproc Cluster

  • To save money, destroy the Dataproc cluster after data processing.

Best Practices

  • Always mask or anonymize PII data when processing.
  • Track PII data access and changes with extensive recording and monitoring.
  • Regularly audit data access and processing for compliance.
  • Data minimization: Process just the PII data you need.

Conclusion

PII processing with Google Distributed Cloud Dataproc requires careful design and execution to maintain data protection and compliance. Follow the methods and recommended practices above to use Dataproc for data processing while protecting sensitive data.

Dataproc

The managed, scalable Dataproc service supports Apache Hadoop, Spark, Flink, Presto, and over thirty open source tools and frameworks. For safe data science, ETL, and data lake modernization at scale that is integrated with Google Cloud at a significantly lower cost, use Dataproc.

ADVANTAGES

Bring your open source data processing up to date.

Your attention may be diverted from your infrastructure to your data and analytics using serverless deployment, logging, and monitoring. Cut the Apache Spark management TCO by as much as 54%. Create and hone models five times faster.

OSS for data science that is seamless and intelligent

Provide native connections with BigQuery, Dataplex, Vertex AI, and OSS notebooks like JupyterLab to let data scientists and analysts do data science tasks with ease.

Google Cloud integration with enterprise security

Features for security include OS Login, customer-managed encryption keys (CMEK), VPC Service Controls, and default at-rest encryption. Add a security setting to enable Hadoop Secure Mode using Kerberos.

Important characteristics

Completely automated and managed open-source big data applications

Your attention may be diverted from your infrastructure to your data and analytics using serverless deployment, logging, and monitoring. Cut the Apache Spark management TCO by as much as 54%. Integrate with Vertex AI Workbench to enable data scientists and engineers to construct and train models 5X faster than with standard notebooks. While Dataproc Metastore removes the need for you to manage your own Hive metastore or catalogue service, the Jobs API from Dataproc makes it simple to integrate large data processing into custom applications.

Use Kubernetes to containerise Apache Spark jobs

Create your Apache Spark jobs with Dataproc on Kubernetes so that you may utilise Dataproc to provide isolation and job portability while using Google Kubernetes Engine (GKE).

Google Cloud integration with enterprise security

By adding a Security Configuration, you can use Kerberos to enable Hadoop Secure Mode when you construct a Dataproc cluster. Additionally, customer-managed encryption keys (CMEK), OS Login, VPC Service Controls, and default at-rest encryption are some of the most often utilised Google Cloud-specific security features employed with Dataproc.

The best of Google Cloud combined with the finest of open source

More than 30 open source frameworks, including Apache Hadoop, Spark, Flink, and Presto, are supported by the managed, scalable Dataproc service. Simultaneously, Dataproc offers native integration with the whole Google Cloud database, analytics, and artificial intelligence ecosystem. Building data applications and linking Dataproc to BigQuery, Vertex AI, Spanner, Pub/Sub, or Data Fusion is a breeze for data scientists and developers.

Thota nithya
Thota nithya
Thota Nithya has been writing Cloud Computing articles for govindhtech from APR 2023. She was a science graduate. She was an enthusiast of cloud computing.
RELATED ARTICLES

Recent Posts

Popular Post

Govindhtech.com Would you like to receive notifications on latest updates? No Yes