Sunday, June 16, 2024

Dataflow’s Secret to No Work Items Left Unturned

Dataflow Google

With a long history of completion and accuracy, Dataflow, the fully managed streaming analytics solution offered by Google Cloud, consistently delivers the goods. In practical terms, that means executing things precisely once and not leaving any shards behind. Three new features straggler detection, hot key detection, and slow worker auto-mitigation are being released by Google Cloud today.

How does Dataflow help with stragglers and what are they?

Jobs classified as stragglers take a lot longer to finish than the typical jobs of the same kind. Numerous factors, including variations in technology, issues with the network, or unequal data distribution, may cause this. For data processing pipelines, stragglers can lead to a variety of issues, such as higher latency, lower throughput, and higher expenses.

Cloud Dataflow

Three techniques exist for dataflow to assist users in handling stragglers:

In-context tooling for comprehensive diagnosis

To assist users in identifying and diagnosing stragglers, Dataflow offers an array of observability tools.

Root-cause analysis

To find the source of an issue, Dataflow examines stragglers. By using this information, stragglers in the future can be avoided.


Dataflow uses proactive load balancing and unhealthy node repair to automatically prevent and mitigate stragglers whenever possible.

This method assists users in reducing the effect that stragglers have on their pipelines.

Google Cloud Dataflow

Handling lagging parties: Contextual observability

In-context observability is the initial Dataflow tactic for handling stragglers. Google Cloud assists users in determining which worker and node stragglers are occurring on and whether there are any. This makes it possible for users to identify and address the straggler by rapidly drilling deeper into the pipeline.

Before moving on to Google Cloud, it’s useful to review a few key terms:

  • Step: The user code defines the reads, writes, and transformations.
  • Fusion: Dataflow’s method of combining several stages or transformations to streamline a pipeline
  • Stage: In dataflow pipelines, the unit of fused steps

Detection of stragglers is new

Customers may now determine when pipelines contain stragglers with the help of Dataflow’s new straggler detection feature.

Detection of batches

A task is deemed a straggler if ALL THREE requirements are met:

  • Compared to other work items at the same stage, it takes an order of magnitude longer to accomplish.
  • Within the stage, parallelism is decreased.
  • It prevents the commencement of fresh work.

Identification of streaming

A task is deemed a straggler if ALL THREE requirements are met:

  • The work item must be in a stage where the watermark lag is greater than ten minutes.
  • Processed for more than five minutes is the work item.
  • The work item is 1.5 times longer processed than the typical work item at that stage.

Stragglers are identified and displayed in the user interface (UI) under the batch/streaming Execution Details tab. Your logs can be filtered to show only the user worker (batch or streaming) on which the straggler was found during the incident’s time range.

Resolving laggards: An examination of the underlying causes

When a straggler is found, Dataflow looks for likely causes as soon as it can. Finding the underlying core cause of stragglers can reduce time spent debugging and allow you to concentrate on reducing the problem, as there are numerous reasons that might cause them.

Hot key detection is new

A hot key (i.e., skewed input data) is one that reflects a disproportionately large number of elements compared to other keys in the same PCollection. Because hot keys insert lengthy sequences of work into the parallel job, they can hinder Dataflow’s ability to complete tasks in parallel.

The occurrence of a hot key is automatically detected by Dataflow. Additionally, Dataflow logs the detected key to aid in debugging when jobs are executed with the pipeline option enabled.

Performance levers like scaling your staff vertically or horizontally usually cannot entirely address hot keys. Rather, identifying hot keys frequently necessitates analysing the phases in your pipeline and focusing on the areas where keys could become disproportionally distributed.

Using a shuffle is one remedial approach example. A user can add an additional sharding key to a dataset that has an unequal distribution around a key. This increases parallelism by further dividing the data into smaller portions.

Dataflow Google Cloud

Handling lagging parties: Auto-mitigation

Auto-mitigation is your strongest line of defence in Dataflow against stragglers. This prevents pipeline straggler impact proactively and doesn’t require your involvement. As a fully managed service, Dataflow strives to offer “zero-knobs” solutions whenever feasible, freeing you from the burden of overseeing the underlying infrastructure to concentrate on your business logic.

Dynamic rebalancing of work

By discovering and reassigning work to other workers, Dataflow’s dynamic work rebalancing feature helps prevent stragglers. It functions by keeping track of each worker’s progress in the pipeline and allocating work to those who finish it more quickly. This makes it easier to guarantee that everyone is producing work at roughly the same pace.

Slow worker auto-mitigation is new

Dataflow automatically reduces lagging workers’ effects. Worker slowness can be caused by a variety of circumstances, including CPU starvation, thrashing, issues with system architecture, and worker processes that get stuck. Work items will be processed far more slowly on a slow worker than they should, which will ultimately lead to a straggler.

The auto-mitigation function initiates a worker’s host maintenance policy for a live migration, restart, or halt when it detects a slow worker. This simulates a host maintenance event. The processed work on the sluggish worker will be smoothly transferred to a new worker in the event of a live migration, maintaining all in-flight data and progress.

Thota nithya
Thota nithya
Thota Nithya has been writing Cloud Computing articles for govindhtech from APR 2023. She was a science graduate. She was an enthusiast of cloud computing.


Please enter your comment!
Please enter your name here

Recent Posts

Popular Post Would you like to receive notifications on latest updates? No Yes