What is Site Reliability Engineering (SRE)?

What is SRE?

Site reliability engineering (SRE) uses software engineering to automate IT operations tasks such as production system management, change management, incident response, even emergency response that would otherwise be performed manually by systems administrators (sysadmins).

The principle behind SRE is that using software code to automate oversight of large software systems is a more scalable and sustainable strategy than manual intervention - especially as those systems extend or migrate to the cloud.

SRE can also reduce or remove much of the natural friction between development teams because some teams want to continually release new or updated software into production. However, operations teams don't want to release any type of update or new software without being sure it won't cause outages or other operations problems. As a result, while not strictly required for DevOps, SRE aligns closely with DevOps principles and can play an important role in DevOps success.

The concept of SRE is credited to Ben Treynor Sloss, VP of engineering at Google, who famously wrote that "SRE is what happens when you ask a software engineer to design an operations team."

The Total Economic Impact™ Of IBM Robotic Process Automation

See a cost and benefit analysis of IBM Robotic Process Automation (RPA).

Related content

Read the analyst report on IBM AI-Powered Automation Solutions

What is site reliability engineering?

Site reliability engineering (SRE) uses software engineering to automate IT operations tasks - for example production system management, change management, incident response, even emergency response - that would otherwise be performed manually by systems administrators (sysadmins).

SRE can also reduce or remove much of the natural friction between development teams who want to continually release new or updated software into production, and operations teams who don't want to release any type of update or new software without being absolutely sure it won't cause outages or other operations problems. As a result, while not strictly required for DevOps, SRE aligns closely with DevOps principles and can be play an important role in DevOps success.

The concept of SRE is credited to Ben Treynor Sloss, VP of engineering at Google, who famously wrote that "SRE is what happens when you ask a software engineer to design an operations team."

What do site reliability engineers do?

A site reliability engineer is a software developer with IT operations experience—someone who knows how to code and who understands how to 'keep the lights on' in a large-scale IT environment.

Site reliability engineers spend half their time performing manual IT operations and system administration tasks—analyzing logs, performance tuning, applying patches, testing production environments, responding to incidents, conducting postmortems. The rest of their time, they develop code that automates those tasks. Their goal is to spend less time on the former and more time on the latter.

At a higher level, the SRE team serves as a bridge between development teams and operations teams, enabling the development team to bring new software or new features to production as quickly as possible. They do this while also ensuring an agreed-upon acceptable level of IT operations performance and error risk in line with the service level agreements (SLAs) the organization has in place with its customers. Based on their experience and a wealth of operations data, the SRE team helps the development and operations teams establish

Service level indicators (SLIs): Measurements of the service level provided by systems—metrics such as availability (uptime) or latency.
Service level objectives (SLOs): Agreed-upon means of measuring service level indicators.
Error budgets: The maximum amount of time a system can fail or underperform without violating the contractual terms of the SLA. More than a metric, the error budget is the tool a site reliability engineering team uses to automatically reconcile a company's pace of innovation with its service reliability.

How do error budgets work?

The error budget is the tool an SRE team uses to automatically reconcile a company's service reliability with its pace of software development and innovation.

Suppose a company's SLA promises 99.99% uptime (a common availability target) per year. That means the monthly error budget—the total amount of downtime allowable without contractual consequence for any given month—is about 4 minutes and 23 seconds.

Now let's say the development team wants to roll out some new features or improvements to the system. If the system is running under the error budget, the team can deliver the new features. If not, the team can't deliver the new features until they work with the operations team to get these errors or outages down to an acceptable level.

In this way, error budgets help development teams and operations teams to

Improve the stability and performance of services.
Make data-driven decisions about deploying new features or applications.
Maximize innovation by taking risks within acceptable limits.

SRE and DevOps

DevOps is a modern way to deliver higher quality applications faster - by automating the software delivery lifecycle, and by giving development and operations teams more shared responsibility and more input into each other’s work.

Like SRE, DevOps makes a business more agile by balancing the need to deliver more applications and changes faster with the need to avoid 'breaking' the production environment. And like SRE, DevOps aims to achieve this balance by establishing an acceptable risk of errors. In fact, SRE and DevOps seem so similar that some experts say they're the same thing—but most see SRE practices as excellent ways to implement DevOps principles. For example:

DevOps principles: Reduce organizational silos, leverage tooling and automation.

SRE practice: Use the same tooling to automate and improve operations as developers use to develop and improve software.

DevOps principles: Accept failure as normal, implement gradual changes.

SRE practice: Use error budgets to continually deploy new features and functionality within acceptable levels.

DevOps principle: Measure everything.

SRE practice: Base decisions to release new software on SLA metrics.

Other SRE benefits

In addition to supporting DevOps success, site reliability engineering can help a company

Gain greater visibility into service health by tracking metrics, logs and traces across all services in the organization and by providing context for identifying root causes in the event of an incident.
Quantify the cost of downtime by helping development and operations teams understand the cost of SLA violations, and helping management quantify the impact of system reliability on production, sales, marketing, customer service and other business functions.
Optimize incident response by building efficient on-call processes and streamlining alerting workflows.
Build a modern network operations center by combining in-depth understanding of IT operations with machine learning and automation, to send alerts directly to the person responsible for addressing the issue.

SRE, cloud and cloud-native development

Migration from traditional IT and on-premises data centers to hybrid cloud environments is one of the chief reasons that the average enterprise generates two to three times more operations data every year. Increasingly, SRE is seen as being critical for leveraging this data to automate systems administration, operations and incident response, and to improve enterprise reliability even as the IT environment becomes more complex.

A cloud-native development approach—specifically, building applications as microservices and deploying them in containers—can simplify application development, deployment and scalability. But cloud-native development also creates an increasingly distributed environment that complicates administration, operations and management. An SRE team can support the rapid pace of innovation enabled by a cloud-native approach and ensure or improve system reliability, without putting more operations pressure on DevOps teams.