Friday, March 28, 2025

The Triangle System to Optimise AIOps Incident Management

Using the Triangle System to optimise AIOps incident management

Improving incident management

In order to govern how and who must handle an event, designated responsible individuals (DRIs) are entrusted with conducting investigations into incoming incidents. This procedure gets more complicated as its product portfolio grows because an event reported against one service could not be the primary reason and could have been caused by a number of dependent services.

Given Azure’s hundreds of services, it is practically difficult for one individual to be an expert in every field. The effectiveness of manual diagnosis is hampered by this, leading to duplicate assignments and longer Time to Mitigate (TTM). In this blog, we’ll explore how the Triangle System, generative AI, and large language models enable us to use automation and feedback loops for more effective incident management.

Because large language models (LLMs) are getting better at reasoning, AI agents are maturing and can now express every stage of their mental processes. LLMs have typically been employed for generative tasks, such as summarisation, without taking advantage of their reasoning powers for practical decision-making. To save time and cut down on redundancy, Microsoft developed AI agents to make the initial assignment decisions for events after seeing a need for this capability.

These agents can think, reason, and employ tools to carry out tasks on their own since they use LLMs as their brains. AI agents can now plan more efficiently because to improved reasoning models, which have lifted earlier restrictions on their capacity to “think” holistically. By guaranteeing speedier incident resolution, this strategy will not only increase efficiency but also improve the user experience overall.

The Triangle System

The Triangle System is a system for incident triage that uses AI agents. To triage concerns, each AI agent is encoded with the team’s domain expertise and represents the engineers of a particular team. Local Triage and Global Triage are two of its sophisticated features.

Local Triage system

Each team is represented by a single agent in the Local Triage System, which is a single agent framework. Based on past events and current troubleshooting guidelines (TSGs), these lone agents make a binary decision regarding whether to accept or reject an incoming issue on behalf of their team. Engineers create TSGs as a collection of rules to troubleshoot typical problem patterns. The agent is trained to accept or reject occurrences and explain the rationale behind the decision using these TSGs. Based on the TSGs, the agent can also suggest which team the incident should be forwarded to.

When an event is added to a service team’s incident queue, the Local Triage mechanism is activated, as seen in below Figure. The single agent uses Generative Pretrained Transformer (GPT) embeddings to capture the semantic meanings of words and sentences based on training from past occurrences and TSGs.

Local Triage system workflow
Image credit to Microsoft Azure
Local Triage system workflow

Extracting semantic information from the incident that is directly connected to the incident being triaged is known as semantic distillation. The incident will then be accepted or rejected by the lone agent. If approved, the agent will explain why, and an engineer will be tasked with reviewing the occurrence. The agent will either return it to the original team, move it to a team designated by the TSG, or leave it in the queue for an engineer to fix if it is refused.

Since the middle of 2024, the Local Triage mechanism has been operational in Azure. Six teams are in production as of January 2025, while more than fifteen teams are undergoing onboarding. Agent accuracy of 90% and a 38% decrease in TTM for one team, which greatly lessened the impact on customers, are encouraging first results.

Global Triage System

Assigning the incident to the appropriate team is the goal of the Global Triage System. To determine which team the incident should be forwarded to, the system uses a multi-agent orchestrator to coordinate among all of the individual agents. In order to further reduce TTM, the multi-agent orchestrator negotiates with each agent to discover the appropriate team candidates for the incoming incident, as illustrated in below Figure.

This method is comparable to that used when patients visit the emergency department, where a nurse quickly evaluates their symptoms before referring them to the appropriate specialist. Agents’ expertise and decision-making skills will grow as they advance the Global Triage System, significantly enhancing both developer productivity by lowering manual labour and the user experience by promptly resolving customer complaints.

Global Triage system workflow
Global Triage system workflow Image credit to Azure

Anticipating

To strengthen the system, Microsoft Azure intends to increase coverage by adding more agents from various teams, which will increase the body of knowledge. Among the methods to use for this are:

  • Expand the issue triage mechanism so that it is applicable to all teams: Our goal is to increase the system’s overall knowledge so that it can manage a variety of problems by making it available to all teams. Handling events more effectively and consistently would result from developing a unified strategy to incident management.
  • By connecting error logs to the precise code segments causing the problem, you may optimise the LLMs to quickly detect and suggest fixes: Troubleshooting will be considerably accelerated by optimising LLMs to swiftly detect, correlate, and provide remedies. It enables the system to provide accurate recommendations, which cuts down on the amount of time engineers spend debugging and speeds up client issue resolution.
  • Extend auto mitigating known issues: By putting in place an automated method to mitigate known difficulties, TTM will be lowered and customer satisfaction will be enhanced. Additionally, fewer situations will need manual intervention, freeing up engineers to concentrate on making customers happy.
Drakshi
Drakshi
Since June 2023, Drakshi has been writing articles of Artificial Intelligence for govindhtech. She was a postgraduate in business administration. She was an enthusiast of Artificial Intelligence.
RELATED ARTICLES

Recent Posts

Popular Post