The cloud On Google Cloud, Bigtable is a well-liked and commonly used key-value database. The service offers a SLA of 99.999% availability, elastic and large scalability, cost effectiveness, and good performance characteristics. Customers now trust Bigtable to handle a wide range of their mission-critical workloads, which has led to widespread adoption.
Bigtable has been used in continuous production at Google for more than 15 years, handling over 10 exabytes of data and processing over 6 billion queries per second at its peak.
YouTube, one of the biggest streaming platforms in the world with millions of producers and billions of consumers, is a Bigtable user within Google. Bigtable is used by YouTube for a variety of use cases, including enabling advertising functionality, tracking metrics like view counts, and helping users discover new material based on what they’ve already viewed.
In this article, we examine the usage of Bigtable by YouTube as a crucial part of a bigger data architecture that powers its reporting dashboards and analytics.
Data Warehouse for YouTube
Videos, channels, and playlists are just a few of the essential elements that YouTube collects in the form of metadata. The metadata contains information on the categories, titles, descriptions, and monetizability of the videos, among other things. Additionally, the connections between entities are recorded. For instance, a video may be posted by a single channel or owner, but numerous owners may claim ownership of the video’s asset rights.
Such dimensional metadata is stored and provided by the YouTube Data Warehouse to power data pipelines and dashboards on YouTube. On Bigtable, the warehouse is constructed. Let’s see a few instances of the warehouse being utilised for analytics.
The Creator Analytics pipeline fills dashboards with viewing information for the content that millions of producers have produced. The pipeline obtains viewing information from YouTube service records with rigorous privacy measures. To provide functionality on the dashboard, the logs are processed, sessionized, and then enhanced with video properties and other types of metadata from the warehouse. The dashboard helps content producers determine how many views and impressions a certain video received by providing segment breakdowns, historical video patterns, and other information.
The Payments pipeline sends daily updates to artists that include expected earnings. Based on statistics from video viewing. The ownership claims, asset rights, and video information are obtained from the warehouse through this pipeline, which also decides how the revenue should be distributed among the owners.
The following are the warehouse’s operational requirements:
- Provide historical entity data versions.
- Support data sources’ near-real-time ingestion
- Permit querying of this information at the volume and power required for YouTube reporting requirements.
- Follow YouTube’s guidelines for protecting user data privacy.
Let’s take a high-level look at Bigtable’s architecture before talking about how it helps YouTube achieve these business criteria for its warehouse.
Architecture
There are three major steps in the processing pipelines that feed the data warehouse.
Data is read and written in raw form to the Bigtable warehouse database in the first stage from upstream canonical sources like operational databases (such as Spanner and other Bigtable databases).
This unprocessed data is taken from the warehouse, cleaned, and converted before being put back there. The transformations make that the data is consistent across sources, has a useful representation, and is simple to comprehend and utilise. This promotes standardised reporting on YouTube.
Consumers can then access the standardised data through regular batch dumps or real-time point lookups.
Entity types (like playlists) and tables in the Bigtable database of the warehouse have a 1:1 mapping. Data from several upstream sources, encoded as protobufs, is included in an entity (for example, a single playlist and its properties), which is saved in a single row.
Dimensions are tracked in the warehouse in two separate ways. Standard dimension tables display the entity’s current condition. For instance, a new row is added to the video entity database whenever a creator uploads a new video. The initial row is changed to reflect the new title if the video’s producer updates it a week later. The row is removed from the database if the creator deletes the video.
Dimension tables with change tracking show how an object has changed over time. For instance, a new row is written if a creator uploads a new video. If a week passes and the video’s title is changed, the old row in the table is left there with a mark indicating it is no longer valid, and a new entry is produced with the updated title.
According to YouTube’s user data privacy protection regulations, if the author decides to delete the video, the row will be marked as deleted and removed from the warehouse. Through “as of” queries, change-tracked dimension tables offer point-in-time access to historical metadata. This is crucial for both backtesting (such as offline model assessment on past data) and data backfills (such as restating data owing to data quality concerns).
Exactly why Bigtable?
Bigtable is an excellent option for the YouTube Data Warehouse for a few reasons.
Flexible data models and schemas
Bigtable’s flexible data model makes it appropriate for use cases when we want to avoid spending too much money integrating new data sources. We want to be able to immediately land raw data and then start to develop a more suitable and standardised data model as we gain a better understanding of the semantics of the data. This makes the architecture and teams more adaptable to a constantly changing environment.
Scale, price, and effectiveness
The warehouse serves as the foundation for the majority of YouTube’s reporting analytics and houses metadata about all of the company’s key entities dating back in time. A scalable database with a low total cost of ownership is required to store and frequently handle this enormous volume of data. Bigtable offers the best price/performance on the market. The batch analytics that utilise the warehouse’s data are ideally matched to its high read and write throughput per dollar spent on resources.
Diverse downstream clients with various access patterns, latencies, and throughput demands use the warehouse. Bigtable provides the ability to link requests to priorities, enabling high-priority serving traffic and lower-priority analytics traffic to coexist without interfering with one another. Bigtable enables the warehouse to serve a diverse clientele with hybrid transactional and analytical processing (HTAP) demands.
Modifying Data Capture
The Change Streams function of Bigtable is used by the warehouse. The related entity rows in Bigtable are invalidated when source data changes. In order to receive the most recent data from the source(s), streaming pipelines that ingest the Bigtable change stream recognise entity rows that have been invalidated. As a result, entities will always have up-to-date metadata to utilise for reporting.
Conclusion
Bigtable provides inexpensive storage with top performance for operational analytics tasks like those handled by the YouTube Data Warehouse. Because of its adaptable data format, it is easier to include new data sources into the warehouse. Data can be immediately landed in its unstructured state, and as we learn more about the semantics of the data, it can progressively become more organised. Such iterative methods of data modelling increase the adaptability and agility of an organisation.
[…] M667 (M667-120G) is a 120GB SATA SSD that sells for around $13 in China. A Chinese data recovery YouTuber recently disassembled one of these drives and discovered what is believed to be NAND for Apple devices that […]
[…] part of their data workflow, engineers use Bigtable to store enormous amounts of transactional and analytical data. The introduction of Bigtable change […]
[…] available AI techniques. Generational AI-generated fake news stories went viral on TikTok and YouTube, and nonconsensual AI-generated porn on Reddit and Etsy. While ChatGPT Role flooded the web with […]