PII data
Due to operational or regulatory constraints, Google Cloud clients who are interested in developing or updating their data lake architecture frequently have to keep a portion of their workloads and data on-premises.
You can now completely modernise your data lake with cloud-based technologies while creating hybrid data processing footprints that enable you to store and process on-prem data that you are unable to shift to the cloud, thanks to Dataproc on Google Distributed Cloud, which was unveiled in preview at Google Cloud Next ’24.
Using Google-provided hardware in your data centre, Dataproc on Google Distributed Cloud enables you to run Apache Spark processing workloads on-premises while preserving compatibility between your local and cloud-based technology.
For instance, in order to comply with regulatory obligations, a sizable European telecoms business is updating its data lake on Google Cloud while maintaining Personally Identifiable Information (PII) data on-premises on Google Distributed Cloud.
Google Cloud will demonstrate in this blog how to utilise Dataproc on Google Distributed Cloud to read PII data that is stored on-premises, compute aggregate metrics, and transfer the final dataset to the cloud’s data lake using Google Cloud Storage.
Compile and encrypt private information on-site
The customer in Google Cloud test scenario is a telecom provider that keeps event records of user calls:
customer id | customer name | call duration | call type | signal strength | device type | location |
1 | <redacted> | 141 | Voice | 379 | LG Q6 | Tammieview, PA |
2 | <redacted> | 26 | Video | 947 | Kyocera Hydro Elite | New Angela, FL |
3 | <redacted> | 117 | Voice | 625 | Huawei Y5 | Toddville, MO |
4 | <redacted> | 36 | Video | 382 | iPhone X | Richmondview, NV |
5 | <redacted> | 110 | Video | 461 | HTC 10 evo | Cowanchester, KS |
6 | <redacted> | 0 | Video | 326 | Galaxy S7 | Nicholsside, NV |
7 | <redacted> | 200 | Data | 448 | Kyocera Hydro Elite | New Taramouth, AR |
8 | <redacted> | 178 | Data | 475 | Galaxy S7 | South Heather, CT |
9 | <redacted> | 200 | Voice | 538 | Oppo Reno6 Pro+ 5G | Gregoryburgh, ID |
10 | <redacted> | 113 | Voice | 878 | ZTE Axon 30 Ultra 5G | Karaview, NV |
11 | <redacted> | 200 | Data | 722 | Huawei P10 Lite | Petersonstad, IA |
12 | <redacted> | 200 | Voice | 1 | HTC 10 evo | West Danielport, CO |
13 | <redacted> | 169 | Voice | 230 | Samsung Galaxy S10+ | North Jose, SD |
14 | <redacted> | 198 | Voice | 1 | Kyocera DuraForce | East Matthewmouth, AS |
15 | <redacted> | 155 | Data | 757 | Oppo Find X | Tuckerchester, MD |
16 | <redacted> | 0 | Data | 1 | ZTE Axon 30 Ultra 5G | New Tammy, NC |
17 | <redacted> | 200 | Data | 656 | Galaxy Note 7 | East Jeanside, NJ |
18 | <redacted> | 15 | Data | 567 | Huawei Y5 | Lake Patrickburgh, OH |
PII is present in this dataset. PII needs to be kept on-site in their own data centre in order to comply with regulations. The customer will store this data on-premises in object storage that is S3-compatible in order to meet this requirement. Now, though, the customer wants to use their larger data lake in Google Cloud to determine the optimal places to invest in new infrastructure by analysing signal strength by geography.
Full local execution of Spark jobs capable of performing an aggregation on signal quality is supported by Dataproc on Google Distributed Cloud, allowing integration with Google Cloud Data Analytics while adhering to compliance standards.
The Cloud Storage output shows multiple low-quality signal areas:
Location | Value |
Georgefurt, MS | 1.0 |
Scottside, MA | 1.0 |
Monroemouth, FL | 1.0 |
Lake Robert, OH | 1.0 |
East Lauren, VA | 1.0 |
Shelleyburgh, CT | 1.0 |
Buckville, ID | 1.0 |
Garzaton, WI | 3.32 |
North Danielle, NY | 3.99 |
Port Natalie, ID | 5.43 |
PII is present in this dataset. PII needs to be kept on-site in their own data centre in order to comply with regulations. The customer will store this data on-premises in object storage that is S3 compatible in order to meet this requirement. The customer now wants to analyse signal strength by location and determine the optimal places for new infrastructure expenditures using their larger data lake in Google Cloud.
Reading PII data with Google Distributed Cloud Dataproc requires various steps to assure data processing and privacy compliance.
To read PII data with Google Distributed Cloud Dataproc, just set up your Google Cloud environment.
- Create a Google Cloud Project: If you don’t have one, create one in GCP.
- Project billing: Enable billing.
- In your Google Cloud project, enable the Dataproc API, Cloud Storage API, and any other relevant APIs.
Prepare PII
- Securely store PII in Google Cloud Storage. Encrypt and restrict bucket and data access.
- Classifying Data: Label data by sensitivity and compliance.
Create and configure Dataproc Cluster
- Create a Dataproc cluster using the Google Cloud Console or gcloud command-line tool. Set the node count and type, and configure the cluster using software and libraries.
- Security Configuration: Set IAM roles and permissions to restrict data access and processing to authorised users.
Develop Your Data Processing Job
- Choose a Processing Framework: Consider Apache Spark or Hadoop.
- Write the Data Processing Job: Create a script or app to process PII. This may involve reading GCS data, transforming it, and writing the output to GCS or another storage solution.
Job Submission to Dataproc Cluster
- Submit your job to the cluster via the Google Cloud Console, gcloud command-line tool, or Dataproc API.
- Check work status and records to guarantee completion.
Compliance and Data Security
- Encrypt data at rest and in transit.
- Use IAM policies to restrict data and resource access.
- Compliance: Follow data protection laws including GDPR and CCPA.
Destruction of Dataproc Cluster
- To save money, destroy the Dataproc cluster after data processing.
Best Practices
- Always mask or anonymize PII data when processing.
- Track PII data access and changes with extensive recording and monitoring.
- Regularly audit data access and processing for compliance.
- Data minimization: Process just the PII data you need.
Conclusion
PII processing with Google Distributed Cloud Dataproc requires careful design and execution to maintain data protection and compliance. Follow the methods and recommended practices above to use Dataproc for data processing while protecting sensitive data.
Dataproc
The managed, scalable Dataproc service supports Apache Hadoop, Spark, Flink, Presto, and over thirty open source tools and frameworks. For safe data science, ETL, and data lake modernization at scale that is integrated with Google Cloud at a significantly lower cost, use Dataproc.
ADVANTAGES
Bring your open source data processing up to date.
Your attention may be diverted from your infrastructure to your data and analytics using serverless deployment, logging, and monitoring. Cut the Apache Spark management TCO by as much as 54%. Create and hone models five times faster.
OSS for data science that is seamless and intelligent
Provide native connections with BigQuery, Dataplex, Vertex AI, and OSS notebooks like JupyterLab to let data scientists and analysts do data science tasks with ease.
Google Cloud integration with enterprise security
Features for security include OS Login, customer-managed encryption keys (CMEK), VPC Service Controls, and default at-rest encryption. Add a security setting to enable Hadoop Secure Mode using Kerberos.
Important characteristics
Completely automated and managed open-source big data applications
Your attention may be diverted from your infrastructure to your data and analytics using serverless deployment, logging, and monitoring. Cut the Apache Spark management TCO by as much as 54%. Integrate with Vertex AI Workbench to enable data scientists and engineers to construct and train models 5X faster than with standard notebooks. While Dataproc Metastore removes the need for you to manage your own Hive metastore or catalogue service, the Jobs API from Dataproc makes it simple to integrate large data processing into custom applications.
Use Kubernetes to containerise Apache Spark jobs
Create your Apache Spark jobs with Dataproc on Kubernetes so that you may utilise Dataproc to provide isolation and job portability while using Google Kubernetes Engine (GKE).
Google Cloud integration with enterprise security
By adding a Security Configuration, you can use Kerberos to enable Hadoop Secure Mode when you construct a Dataproc cluster. Additionally, customer-managed encryption keys (CMEK), OS Login, VPC Service Controls, and default at-rest encryption are some of the most often utilised Google Cloud-specific security features employed with Dataproc.
The best of Google Cloud combined with the finest of open source
More than 30 open source frameworks, including Apache Hadoop, Spark, Flink, and Presto, are supported by the managed, scalable Dataproc service. Simultaneously, Dataproc offers native integration with the whole Google Cloud database, analytics, and artificial intelligence ecosystem. Building data applications and linking Dataproc to BigQuery, Vertex AI, Spanner, Pub/Sub, or Data Fusion is a breeze for data scientists and developers.