Monday, May 27, 2024

Memory Leak Detection in Cloud Computing with RESIN

What is memory leak in Cloud Services?

Memory leaks are a recurring problem in the rapidly changing cloud computing environment, impacting stability, performance, and ultimately, user experience. Memory leak monitoring is therefore crucial to the calibre of cloud services. Memory leaks occur when accidentally allocated memory is not freed in a timely manner. It can result in the component’s performance degrading and even system crashes (OS). Even worse, it frequently slows down or even terminates other processes that are operating on the same system.
Memory leak detection has been the subject of numerous studies and solutions due to the significant impact of these issues. Static and dynamic detection are the two types of traditional detection techniques. While the dynamic method discovers leaks by instrumenting a programme and tracking object references at runtime, static methods analyse software source code to infer possible leaks.

Nevertheless, these traditional methods of memory leak detection are insufficient to address leak detection requirements in cloud environments. Particularly for leaks resulting from cross-component contract breaches, which require significant domain knowledge to capture statically, the static techniques have limited scalability and accuracy. The dynamic techniques work better in cloud environments overall. They do, however, require a lot of equipment and are invasive. They also add a lot of runtime overhead, which is expensive for cloud services.

What is RESIN in Cloud Infrastructure ?

Microsoft Azure introducing RESIN, an end-to-end memory leak detection solution that is intended to address memory leaks in large cloud infrastructure in a comprehensive manner. Utilised in Microsoft Azure production environments, RESIN has proven to be an efficient leak detection tool with low overhead and good accuracy.

Workflow of the RESIN system

Numerous teams may hold hundreds of software components that make up a huge cloud infrastructure. Memory leak detection in Microsoft Azure was done by a single team before RESIN. For low overhead, high accuracy, and scalability, RESIN uses a centralised strategy, as seen in below Figure , to perform leak detection in many stages. This method doesn’t require a lot of instrumentation or re-compilation, nor does it require access to the source code of the components.

Resin Workflow
Image credit to Micro soft Azure

Utilising monitoring agents, RESIN performs low-overhead monitoring to gather host-level memory telemetry data. A bucketization-pivot strategy is used in a remote service to aggregate and analyse data from various servers. RESIN initiates an analysis of the process instances within a bucket upon detection of leakage. RESIN does live heap snapshotting for highly suspect leaks and compares the results to regular heap snapshots stored in a reference database. In the end, RESIN automatically mitigates the leaking process by collecting numerous heap snapshots, running a diagnosis algorithm to identify the leak’s fundamental cause, and producing a diagnosis report to attach to the alert ticket and help developers with their further study.
methods for detection

  • Memory leak detection presents particular difficulties in cloud infrastructure:
    • Memory leaks in production systems are typically fail-slow faults that could last days, weeks, or even months, and it can be challenging to capture gradual change over long periods of time in a timely manner.
    • At the scale of Azure global cloud, collecting fine-grained data over long periods of time is not practical.
    • Noisy memory usage caused by changing workload and interference in the environment results in high noise in detection using static threshold-based approach.

In order to overcome these difficulties, RESIN employs a two-level method for identifying memory leak symptoms: To find suspect components, a global bucket-based pivot analysis is used; to find leaky processes, a local individual process leak detection method is employed.

We divide the raw memory consumption data into several buckets and convert the usage data into a summary of the number of hosts in each bucket using bucket-based pivot analysis at the component level. Furthermore, based on the deviations and host count in each bucket, a severity score is computed for each bucket. Every component’s bucket’s time-series data is used to detect anomalies. In addition to providing a robust representation of the workload trend with noise tolerance, the bucketization strategy lowers the computing burden associated with anomaly identification.

However, since many processes typically operate on a component, detection at the component level alone is insufficient for developers to conduct an effective leak investigation. RESIN performs a second-level detection scheme at the process granularity in order to focus the investigation’s scope when a leaky bucket is discovered at the component level. The process that is suspected of leaking, together with its start and end times and severity score, are output.

Evaluation of found leaks

When a memory leak is identified, RESIN captures a snapshot of the live heap, which includes all memory allocations that the programme that is now executing is referencing. It then examines the snapshots to determine the source of the leak. Actionable memory leak alerts result from this.
Moreover, RESIN uses the snapshot feature of Windows heap management to carry out real-time profiling. But heap collection might be costly and interfere with the host’s performance. A few factors are taken into account when determining how snapshots are taken in order to reduce the overhead associated with heap collection.

  • The heap management only keeps a small amount of data in each snapshot, such as the size of each active allocation and the stack trace.
  • RESIN assigns a snapshotting priority to potential hosts according to the degree of leakage, noise level, and impact on customers. To guarantee a successful collection, the top three hosts on the suspected list are chosen by default.
    Resin employs a long-term, trigger-based approach to guarantee that the snapshots fully capture the leak. RESIN uses a pattern-based technique to determine the trace completion triggers and analyses memory growth patterns (such as steady, spike, or stair) to help in deciding when to terminate the trace collection.
  • To facilitate diagnosis, RESIS builds reference snapshots using a periodic fingerprinting procedure, which is compared with the snapshot of the suspected leaking process.
  • RESIN generates stack traces of the root by analysing the gathered snapshots.

Modification of detected leaks

RESIN makes an automated effort to address memory leaks in order to minimise any future effects on customers. A few different kinds of mitigation steps are made to lessen the issue, depending on the nature of the leak. RESIN selects a mitigation strategy that reduces the impact by using a rule-based decision tree.

Restarting the process or service is RESIN’s effort at the least intrusive solution if the memory leak is limited to a single Windows service or process. Rebooting the operating system can fix software memory leaks, but it is usually the last option because it is time-consuming and might create virtual machine outages. In order to minimise user effect, RESIN uses solutions like Project Tardigrade for non-empty hosts. This approach bypasses hardware initialization and only executes a kernel soft reboot following live virtual machine migration. Only when the soft reboot fails is a full OS reboot executed.

When the detection engine no longer believes that a target is leaking, RESIN ceases to perform mitigation actions to that target.

Impact and outcome of memory leak identification

Since its deployment in late 2018, RESIN has been utilised in Azure production to monitor millions of host nodes and hundreds of host processes on a daily basis. Using RESIN memory leak detection, we were able to attain an overall accuracy of 85% and a recall rate of 91%, even with the fast expanding cloud infrastructure under observation.
he two most important indicators that RESIN provides amply illustrate its end-to-end benefits:

  1. Unexpected reboots of virtual machines: the daily average of reboots per 100,000 hosts as a result of low memory.
  2. The ratio of incorrect virtual machine allocation requests brought on by low memory is known as the virtual machine allocation error.

Virtual machine reboots decreased by almost 100 times and allocation error rates decreased by more than 30 times between September 2020 and December 2023. Furthermore, Azure host memory leaks have not resulted in any significant disruptions since 2020.

News Source

Thota nithya
Thota nithya
Thota Nithya has been writing Cloud Computing articles for govindhtech from APR 2023. She was a science graduate. She was an enthusiast of cloud computing.


Please enter your comment!
Please enter your name here

Recent Posts

Popular Post Would you like to receive notifications on latest updates? No Yes