• IEEE.org
  • IEEE CS Standards
  • Career Center
  • About Us
  • Subscribe to Newsletter

0

IEEE
CS Logo
  • MEMBERSHIP
  • CONFERENCES
  • PUBLICATIONS
  • EDUCATION & CAREER
  • VOLUNTEER
  • ABOUT
  • Join Us
CS Logo

0

IEEE Computer Society Logo
Sign up for our newsletter
IEEE COMPUTER SOCIETY
About UsBoard of GovernorsNewslettersPress RoomIEEE Support CenterContact Us
COMPUTING RESOURCES
Career CenterCourses & CertificationsWebinarsPodcastsTech NewsMembership
BUSINESS SOLUTIONS
Corporate PartnershipsConference Sponsorships & ExhibitsAdvertisingRecruitingDigital Library Institutional Subscriptions
DIGITAL LIBRARY
MagazinesJournalsConference ProceedingsVideo LibraryLibrarian Resources
COMMUNITY RESOURCES
GovernanceConference OrganizersAuthorsChaptersCommunities
POLICIES
PrivacyAccessibility StatementIEEE Nondiscrimination PolicyIEEE Ethics ReportingXML Sitemap

Copyright 2025 IEEE - All rights reserved. A public charity, IEEE is the world’s largest technical professional organization dedicated to advancing technology for the benefit of humanity.

FacebookTwitterLinkedInInstagramYoutube
  • Home
  • /Publications
  • /Tech News
  • /Trends
  • Home
  • / ...
  • /Tech News
  • /Trends

Building Long Term Reliability in Machine learning Models via Service level objectives: A Primer

By Sunandan Barman on
December 2, 2024

colorful futuristic image of lightscolorful futuristic image of lightsMachine learning models are increasingly being used in all aspects of software systems, be it banking, robotics, transportation, e-commerce, social media etc. The market size in the Machine Learning market is projected to reach US$79.29bn in 2024 & expected to reach a market volume of US$503.40bn by 2030 [1].

The staggering amount of money invested poses a serious question to the management & engineers: How should the engineering community think about long term reliability in these complex systems like Machine learning models. To make steady progress towards reliability goals, “Service level objectives” as a concept has been used for a long term in the monitoring & observability space of software engineering.

First let’s understand in simple terms what does SLA, SLO and SLI mean for the readers who are not familiar with these terms.

SLIs (Service Level Indicators)


This measures what the system is already doing. Formally, SLIs are metrics or measurements used to track and monitor the performance and behavior of a service. Depending on what we are measuring, we either want a lower number than currently, or a higher number than what we have currently.

For example, let's say we want to measure performance of some social media interaction. An SLI for this service could be the average response time for user interactions. This SLI measures how quickly the platform responds to user actions, such as liking a post or commenting on a photo. A lower response time means a better SLI in this case. If we are measuring the availability of Google cloud provider as an SLI, then in that case we want a higher SLI.

SLOs (Service Level Objectives)


This is the value we hope to achieve internally for the metrics being measured. Based on SLIs, they set a target or threshold the service should meet or exceed. SLOs help establish a clear understanding of what users can expect from a service and provide a basis for assessing its quality. Taking the previous example again, for google cloud provider the SLO could be that the availability is 99.99%.

SLA (Service Level Agreement)

These are customer-facing agreements about system metrics (like reliability, availability). Violation of SLAs have consequences, depending on the system. SLA will promise goals of e.g. reliability that are close to what we can actually deliver but lesser than internal SLO goals.

For google cloud, SLA could be 99.95%. This means the external promise to users is a little lower than what the SLO or internal goaling is. This provides scope for the engineering team to determine what are the future opportunities so that SLA can move towards SLO.

Visualization of SLI, SLO and SLA as user interacts with a serviceVisualization of SLI, SLO and SLA as user interacts with a service

SLOs in ML Models


SLOs have been a proven solution for traditional services, which use standard, well-documented metrics like Availability, Latency and Correctness as SLIs to monitor service health. Some famous examples are AWS [2] & Google Cloud [3]. However, fitting SLOs for ML models requires fresh perspective due to different operational behaviors of ML models from traditional services such as non-deterministic output, lack of transparency, sensitivity to data drift etc.

Machine learning models lifecycleMachine learning models lifecycle

Machine Learning Model Lifecycle


The concept of SLOs can be fitted for ML models when we see that the entire ML model lifecycle can be grouped into three buckets:

  • Model training
  • Model serving
  • Model performance

Google Ads ML SRE Director Todd Underwood presented at SRECon21 [4], some issues that can be monitored for ML systems. Some examples presented are: Training is too slow or stuck, data distribution changes dramatically, model fails to load in serving, training requires much more resources & significant change in model quality. We can combine the model lifecycle idea with the ideas presented at this talk & come up with a few metric suggestions.

Model Training


Training Time can be an important SLI to help us identify scenarios where training time needs to be optimized, such as in online training or resource constrained environments.

Training label ratio % metric is a measure of the proportion of positive labels (e.g. 1) to negative labels (e.g. 0) in the training data. This metric is useful to monitor since this influences how the model fits the data and how quickly the model reaches a stable and optimal state during training.

Model Serving


Model Serving is the aspect of ML Systems that aligns closer with general software systems. For this aspect, we can use the metrics:

  • Model Uptime or serving rate or inversely % of time the model was not served in production
  • Model freshness or inversely model staleness - A stale model is usually a worse performing model [5].

Model Performance


While training and serving can be easier to reason about, the goal of using ML is not to build a model but to use it in an application in a way that can provide value, and none of the previous examples captures the performance of the system from the business/user value perspective.

For this we need to understand the end goals of ML models from the business perspective. If the model is trying to increase the reach of a social media post, then post engagement will be a metric to consider. We can use the calibration metric as SLO for this.

Calibration


Model calibration is the process of adjusting model parameters to improve a model's accuracy and ensure that its predictions match observed data. By monitoring calibration we can infer if a model is matching the expectations.

Conclusion


Reliability is an inescapable part of building software systems, and with the exponential investment growth in ML models, building long term strategies for reliability in ML models has become imperative. Engineers need to think closely about what use-case is the ML model solving, and consider the metrics to monitor accordingly to drive long term reliability goals. There is no one stop solution but there is sufficient literature available to guide this work.

References

    • https://www.statista.com/outlook/tmo/artificial-intelligence/machine-learning/worldwide
    • https://aws.amazon.com/legal/service-level-agreements
    • https://cloud.google.com/terms/sla?hl=en
    • https://www.usenix.org/conference/srecon21/presentation/underwood-sre-ml
    • https://www.cs.cmu.edu/~epxing/papers/2019/Dai_etal_ICLR19.pdf

Disclaimer: The author is completely responsible for the content of this article. The opinions expressed are their own and do not represent IEEE's position nor that of the Computer Society nor its Leadership.

LATEST NEWS
From Isolation to Innovation: Establishing a Computer Training Center to Empower Hinterland Communities
From Isolation to Innovation: Establishing a Computer Training Center to Empower Hinterland Communities
IEEE Uganda Section: Tackling Climate Change and Food Security Through AI and IoT
IEEE Uganda Section: Tackling Climate Change and Food Security Through AI and IoT
Blockchain Service Capability Evaluation (IEEE Std 3230.03-2025)
Blockchain Service Capability Evaluation (IEEE Std 3230.03-2025)
Autonomous Observability: AI Agents That Debug AI
Autonomous Observability: AI Agents That Debug AI
Disaggregating LLM Infrastructure: Solving the Hidden Bottleneck in AI Inference
Disaggregating LLM Infrastructure: Solving the Hidden Bottleneck in AI Inference
Read Next

From Isolation to Innovation: Establishing a Computer Training Center to Empower Hinterland Communities

IEEE Uganda Section: Tackling Climate Change and Food Security Through AI and IoT

Blockchain Service Capability Evaluation (IEEE Std 3230.03-2025)

Autonomous Observability: AI Agents That Debug AI

Disaggregating LLM Infrastructure: Solving the Hidden Bottleneck in AI Inference

Copilot Ergonomics: UI Patterns that Reduce Cognitive Load

The Myth of AI Neutrality in Search Algorithms

Gen AI and LLMs: Rebuilding Trust in a Synthetic Information Age

Get the latest news and technology trends for computing professionals with ComputingEdge
Sign up for our newsletter