Regarding
One well-liked solution for coordinating data operations is Apache Airflow. Authoring, scheduling, and monitoring pipelines is made possible with Google Cloud’s fully managed workflow orchestration solution, Cloud Composer, which is based on Apache Airflow. It’s also critical to have a strong logging and alerting setup when using Cloud Composer in order to keep an eye on your directed acyclic graphs (DAGs) and reduce downtime in your data pipelines.
We will go over the Cloud Composer GCP alerting hierarchy in this article, along with the many alerting options that Google Cloud engineers can use with Cloud Composer and Apache Airflow.
Beginning
Alerting hierarchy on Cloud Composer
The composer’s surroundings
Self-contained Airflow deployments built on Google Kubernetes Engine are known as Cloud Composer GCP environments. They use Airflow’s built-in connectors to interface with other Google Cloud services.
The Google Cloud services that power your workflows and every Airflow component are provisioned via Cloud Composer GCP. The GKE cluster, Airflow web server, Airflow database, and Cloud Storage bucket are the essential elements of an environment.
At this level, cluster and Airflow component performance and health are the main alert topics.
Airflow DAG Runs
An object that represents a DAG instantiation at a specific moment in time is called a DAG run. Every time the DAG is run, all of the actions inside it are carried out, and a DAG Run is created. The task’s state determines the DAG Run’s status. Since each DAG run is conducted independently of the others, many DAG runs can occur concurrently.
SLA Misses and DAG Run state changes, such as Success and Failure, are the main types of alerts at this level. Code can be triggered by Airflow’s Callback feature to convey these alerts.
Airflow Task instances
The fundamental unit of execution in Airflow is a task. To specify the sequence in which they should run, tasks are grouped into DAGs and upstream and downstream dependencies are established between them. Operators and sensors are involved in airflow tasks.
Airflow Tasks can use Airflow Callbacks to trigger code to send alerts, just like Airflow DAG Runs can.
In brief
To encapsulate the alerting hierarchy of Airflow: Flow Composer Service → Cloud Composer Environment → Worker Components → DAG Run => Airflow Task Instance on Google Cloud.
Every Cloud Composer production-level implementation should have the ability to monitor and issue alerts at every level of the hierarchy. There is a tons of material available from Google’s Cloud Composer GCP engineering team regarding service and environment level alerts and monitoring.
Alerts for Airflow on Google Cloud
Let’s now look at three possibilities for Airflow DAG Run and Airflow Task level alerting.
Option 1: alerting policies based on logs
Within your Airflow system, Google Cloud provides built-in features for alerting and logging. While Cloud Monitoring enables you to build up alerting settings based on certain log entries or metrics thresholds, Cloud Logging centralises logs from several sources, including Airflow.
An alerting policy can be set up to warn you each time a particular message surfaces in your included logs. For instance, you can receive alerts when specific data-access messages are recorded in an audit log. This way, you can find out when the messages occur. Log-based alerting policies are the name given to these kinds of alerting strategies.
These services work well with the previously described Callback functionality of Airflow. To make this happen:
- Establish a Callback function at the Task or DAG level.
- Create a specific log message and send it to Cloud Logging using Python’s built-in logging package.
- Establish a log-based alerting policy that notifies a notification channel and is triggered by a particular log message.
Advantages and disadvantages
Advantages:
- Simple, lightweight setup: no need for extra Airflow providers, email servers, or third-party components
- Integration for deeper insights and historical analysis with Logs Explorer and log-based metrics
- Several possibilities for notification channels
Disadvantages:
- Email notifications don’t have much information.
- The setup costs and learning curve for log sinks and alerting rules
- The expenses related to using cloud logging and cloud monitoring
To miss a specified SLA and/or raise an Airflow Exception, this Airflow DAG employs a Python operator. The log_on_dag_failure and log_on_sla_miss callback functions are triggered when the DAG Run enters a failed state and a missed SLA, respectively. The precise message strings “Airflow DAG Failure:” and “Airflow SLA Miss:” are recorded by each of these callbacks. These are the messages that are captured by log-based alerting and used to send a notification to the designated channel.
Option 2: Use SendGrid to send email alerts
The preferred email notification solution for Cloud Composer GCP is SendGrid, an SMTP service provider.
Advantages and disadvantages
Advantages:
- A widely accepted and trustworthy notification technique
- Comprehensive emails with organised log excerpts for examination
- Utilises natural Airflow Operator by Email
- Adaptable recipient lists according to each task
Disadvantages:
- Can be overpowering if there are a lot of warnings.
- Requires setting up email templates and an external email service (SendGrid).
- Can disappear from inboxes if improperly filtered or prioritised.
- Charges related to SendGrid
Option 3: Using other programs like Pagerduty and Slack
Since Airflow is open-source, you have options for other providers, like Slack or Pagerduty, who can manage alerts and notifications for you.
Advantages and disadvantages
Advantages:
- Notifications in real time via a dependable communication channel
- Formatting that can be customized and the ability to deliver messages to particular channels or individuals
- Options from third parties fit in with the current communication process of your team. While maintaining the context and resolution processes together, alerts can be discussed directly. As opposed to isolated emails or logging entries, this facilitates quicker troubleshooting and knowledge sharing.
Disadvantages:
- Requires setting up an API token, a webhook, and a third-party workspace.
- Need additional Airflow connections to be managed.
- Could result in notification fatigue if not used carefully.
- Possible security risks in the event that the API token or webhook are compromised
- Potentially constrained long-term log retention in the history of messages from third parties
- Expenses related to using outside tools
Constructive advice and future measures
Google Cloud advise log-based alerting strategies (Option 1) for Airflow alerting in production situations after weighing the benefits and drawbacks. This method provides flexible notification channels, threshold-based alerting that is easy to use, scalable log collecting, metric investigation, and seamless connection with other Google Cloud services. With its seamless integration with Cloud Composer GCP, logging is simple to use and does not require additional provider packages.
You can fully utilise Google Cloud and proactively monitor your data pipelines by integrating logging and alerting into your Airflow DAGs.