Building Long Term Reliability in Machine learning Models via Service level objectives: A Primer

By Sunandan Barman on

December 2, 2024

colorful futuristic image of lights Machine learning models are increasingly being used in all aspects of software systems, be it banking, robotics, transportation, e-commerce, social media etc. The market size in the Machine Learning market is projected to reach US$79.29bn in 2024 & expected to reach a market volume of US$503.40bn by 2030 [1].

The staggering amount of money invested poses a serious question to the management & engineers: How should the engineering community think about long term reliability in these complex systems like Machine learning models. To make steady progress towards reliability goals, “Service level objectives” as a concept has been used for a long term in the monitoring & observability space of software engineering.

First let’s understand in simple terms what does SLA, SLO and SLI mean for the readers who are not familiar with these terms.

SLIs (Service Level Indicators)

This measures what the system is already doing. Formally, SLIs are metrics or measurements used to track and monitor the performance and behavior of a service. Depending on what we are measuring, we either want a lower number than currently, or a higher number than what we have currently.

For example, let's say we want to measure performance of some social media interaction. An SLI for this service could be the average response time for user interactions. This SLI measures how quickly the platform responds to user actions, such as liking a post or commenting on a photo. A lower response time means a better SLI in this case. If we are measuring the availability of Google cloud provider as an SLI, then in that case we want a higher SLI.

SLOs (Service Level Objectives)

This is the value we hope to achieve internally for the metrics being measured. Based on SLIs, they set a target or threshold the service should meet or exceed. SLOs help establish a clear understanding of what users can expect from a service and provide a basis for assessing its quality. Taking the previous example again, for google cloud provider the SLO could be that the availability is 99.99%.

SLA (Service Level Agreement)

These are customer-facing agreements about system metrics (like reliability, availability). Violation of SLAs have consequences, depending on the system. SLA will promise goals of e.g. reliability that are close to what we can actually deliver but lesser than internal SLO goals.

For google cloud, SLA could be 99.95%. This means the external promise to users is a little lower than what the SLO or internal goaling is. This provides scope for the engineering team to determine what are the future opportunities so that SLA can move towards SLO.

Visualization of SLI, SLO and SLA as user interacts with a service

SLOs in ML Models

SLOs have been a proven solution for traditional services, which use standard, well-documented metrics like Availability, Latency and Correctness as SLIs to monitor service health. Some famous examples are AWS [2] & Google Cloud [3]. However, fitting SLOs for ML models requires fresh perspective due to different operational behaviors of ML models from traditional services such as non-deterministic output, lack of transparency, sensitivity to data drift etc.

Machine Learning Model Lifecycle

The concept of SLOs can be fitted for ML models when we see that the entire ML model lifecycle can be grouped into three buckets:

Model training
Model serving
Model performance

Google Ads ML SRE Director Todd Underwood presented at SRECon21 [4], some issues that can be monitored for ML systems. Some examples presented are: Training is too slow or stuck, data distribution changes dramatically, model fails to load in serving, training requires much more resources & significant change in model quality. We can combine the model lifecycle idea with the ideas presented at this talk & come up with a few metric suggestions.

Model Training

Training Time can be an important SLI to help us identify scenarios where training time needs to be optimized, such as in online training or resource constrained environments.

Training label ratio % metric is a measure of the proportion of positive labels (e.g. 1) to negative labels (e.g. 0) in the training data. This metric is useful to monitor since this influences how the model fits the data and how quickly the model reaches a stable and optimal state during training.

Model Serving

Model Serving is the aspect of ML Systems that aligns closer with general software systems. For this aspect, we can use the metrics:

Model Uptime or serving rate or inversely % of time the model was not served in production
Model freshness or inversely model staleness - A stale model is usually a worse performing model [5].

Model Performance

While training and serving can be easier to reason about, the goal of using ML is not to build a model but to use it in an application in a way that can provide value, and none of the previous examples captures the performance of the system from the business/user value perspective.

For this we need to understand the end goals of ML models from the business perspective. If the model is trying to increase the reach of a social media post, then post engagement will be a metric to consider. We can use the calibration metric as SLO for this.

Calibration

Model calibration is the process of adjusting model parameters to improve a model's accuracy and ensure that its predictions match observed data. By monitoring calibration we can infer if a model is matching the expectations.

Conclusion

Reliability is an inescapable part of building software systems, and with the exponential investment growth in ML models, building long term strategies for reliability in ML models has become imperative. Engineers need to think closely about what use-case is the ML model solving, and consider the metrics to monitor accordingly to drive long term reliability goals. There is no one stop solution but there is sufficient literature available to guide this work.

References

Disclaimer: The author is completely responsible for the content of this article. The opinions expressed are their own and do not represent IEEE's position nor that of the Computer Society nor its Leadership.