Ensuring Data Safety: BigQuery Cross-Region replication
One of the most important factors in creating a robust cloud data lake architecture is geographic redundancy. Customers may want to replicate data regionally for a variety of reasons, including low-latency reads (where the data is closer to the end user), regulatory compliance, colocating data with other services, and maintaining data redundancy for apps that are essential to business operations.
Within a dataset region, BigQuery already keeps copies of your data in two distinct Google Cloud zones. Symmetric dual writes are used for zone-to-zone replication in all regions. This guarantees that in the event of a zone failure, whether mild (power outage, network partition) or hard (flood, earthquake, hurricane), there won’t be any data loss and you’ll be back online fairly instantly.
With the preview of cross-region dataset replication, Google are thrilled to advance this further by making it simple to duplicate any dataset including live updates across cloud regions. Cross-region replication can be used for more purposes than only continuous replication; it can also be used to move BigQuery datasets from one source area to another destination region.
How does it function?
For cross-regional replication, BigQuery offers the following primary and secondary configurations:
Primary region: BigQuery defines the chosen area as the primary replica’s location when you create a dataset.
Secondary region: BigQuery identifies a dataset replica as a secondary replica when you add it to a designated region. You may designate any region as the secondary region. It is possible to possess many secondary replicas.
The secondary replica can only be read; the primary replica can be written to. Asynchronous replication occurs between writes to the primary replica and the secondary replica. Two redundant zones are used to store the data within each region. The Google Cloud network never experiences network outage.
Replicas have the same names even when they are in different places. This implies that when referring a replica in a new region, your queries don’t need to alter.
Replication at work
You may set up replication for your BigQuery datasets using the method that follows.
Make a duplicate of the specified dataset.
Use the ALTER SCHEMA ADD REPLICA DDL statement to duplicate a dataset.
A single replica can be added to any dataset in any area or multi-region. The initial copy operation takes some time to finish once you add a replica. While the data is being duplicated, you can continue to run queries that reference the primary replica without experiencing any decrease in query processing speed.
— Create the primary replica in the primary region.CREATE SCHEMA my_dataset OPTIONS(location=’us-west1′);– Create a replica in the secondary region.ALTER SCHEMA my_datasetADD REPLICA `us-east1`OPTIONS(location=’us-east1′);
In the INFORMATION_SCHEMA.SCHEMATA_REPLICAS view, you can query the creation_complete column to verify that the secondary replica has been created successfully.
— Check the status of the replica in the secondary region.SELECT creation_time, schema_name, replica_name, creation_completeFROM `region-us-west1`.INFORMATION_SCHEMA.SCHEMATA_REPLICASWHERE schema_name = ‘my_dataset’;
Check the backup copy
You can query a secondary replica with read-only queries once the original creation is finished. To accomplish this, use the BigQuery API or query settings to set the job location to the secondary region. BigQuery automatically directs your queries to the primary replica’s location if you don’t provide one.
You must have a reservation at the secondary replica’s location if you are utilizing BigQuery’s capacity reservations. If not, BigQuery will process your queries according to an on-demand model.
Make the backup replica the main one
Use the ALTER SCHEMA SET OPTIONS DDL statement and set the primary_replica option to promote a replica to be the primary replica. The secondary region must be specifically specified as the job location in the query settings.
ALTER SCHEMA my_dataset SET OPTIONS(primary_replica = ‘us-east1’)
The secondary replica takes over as the primary replica after a short while, at which point you can perform read and write activities in the new location. Likewise, the primary replica turns into a secondary replica that can only do read operations.
Take out a duplicate dataset
Use the ALTER SCHEMA DROP REPLICA DDL statement to eliminate a replica and cease duplicating the dataset. Once the secondary has been promoted to primary, destroy the replica if you are using replication to migrate data between regions. Although not necessary, this step can be helpful if you don’t require a dataset replica for anything other than migration purposes.
Starting out
Google are thrilled to announce that BigQuery now supports the cross-region replication preview, enabling you to improve geo-redundancy and accommodate use cases involving region movement. In the future, Google want to incorporate a console-based user interface for replica configuration and management. In the unlikely event of a complete regional outage, Google will also provide a cross-region disaster recovery (DR) functionality that expands cross-region replication to safeguard your workloads. The BigQuery cross-region dataset replication QuickStart has additional information regarding BigQuery and cross-region replication.