In the traditional software development process, development and IT operation teams working in silos during the application development life cycle. For small scale software projects, this approach may not affects considerably. But as the projects scale up, due to the complexity of systems, development and operation teams require to work together from the development phase to the operation stage to deliver a reliable product. New software development approaches such as DevOps and Site Reliability Engineering (SRE) have been introduced to remove these teams' barriers. SRE creates a link between development and operations to improve software applications' reliability.
The concept of SRE was first coined in 2003 by Ben Treynor Sloss, VP of engineering at Google. When he joined Google, he was appointed to run a production team of seven engineers, and this team responsible for ensuring Google is always available, performant, and efficient.
Site reliability engineering is a software engineering discipline that applies software engineering principles to IT operation tasks such as technical management, change management, monitoring, emergency response, and capacity planning. In an interview, Ben Treynor stated the functions of SRE;
Fundamentally, it's what happens when you ask a software engineer to design an operations function. When I came to Google, I was fortunate enough to be part of a team that was partially composed of folks who were software engineers and who were inclined to use the software as a way of solving problems that had historically been solved by hand. So when it was time to create a formal team to do this operational work, it was natural to take the "everything can be treated as a software problem" approach and run with it.
So SRE is fundamentally doing work that has historically been done by an operations team but using engineers with software expertise and banking on the fact that these engineers are inherently both predisposed to and have the ability to substitute automation for human labor.
Though Google had initiated the concept of SRE, later, other tech giants such as Amazon, Netflix, and Amazon also adopted the concept, and now it became an industry standard. However, though other tech companies follow Google’s SRE guidelines, they have implemented their own SRE strategies based on businesses' nature.
Typically, development teams want to make available new software applications or software updates for use as soon as possible. On the other hand, IT operation teams want to ensure that new software or updates up to the level of service standards agreed in the Service-Level Agreement (SLA). This creates tension between two teams, and the SRE approach seeks to shrink this gap. SRE tries to solve the toil by automating the manual and repetitive operational tasks such as capacity planning, performance planning, disaster response, and on-call monitoring. Automation ensures the efficiency and reliability of such operational tasks. SRE uses several essential tools to ensure the reliability and availability of systems.
Service-Level Indicator (SLI): Availability is fundamental to any system. Because if a system unavailable to function, it cannot accomplish any of its intended tasks. SLI is a quantitative measure of the level of services such as availability or uptime.
Service-Level Objective (SLO): SLO is the numerical target of system availability, which sets when defining SRE terms.
Service-Level Agreement (SLA): A business-level contract between the service provider and customer defines the consequences if the service fails to meet the SLOs.
SRE approach greatly leverages on both development and IT operation teams by removing the barriers between them. To do that, site reliability engineers must know about software development and IT operations. According to Google’s SRE discipline, site reliability engineers should spend at least 50% of their time developing. The remaining time should spend on IT operations such as on-call operations, documentation, fixing support escalation issues, etc. Based on the SRE fundamentals such as SLI, SLO, and SLA, site reliability engineers take preliminary actions to release new software or update. Automation reduces the workload of all teams. SREs also look for the possibilities to automate the manual tasks repeatedly.
Site reliability engineers are involved in fixing support escalation issues. Since they have in-depth knowledge of the system, they can fix the issues or redirect them to the relevant teams. Being available to respond to production incidents helps maintain the reliability and performance of services. To do that, SREs require to perform some on-call duties such as diagnose and mitigate incidents, repair production problems, and automate operational tasks. After performing both development and IT operations tasks, SREs gain substantial knowledge on multiple domains, and it’s a good practice to documenting this knowledge for future use.
Both the SRE and DevOps approaches try to solve a common set of problems. Precisely, remove the barriers between software development and IT operations and create a bridge to link them. Like SRE, DevOps is a set of guidelines designed to break down silos in the application development life cycle. Though both disciplines address a similar issue, DevOps doesn’t define solutions in a detailed manner. On the other hand, SRE describes how to accomplish the DevOps philosophies. As Google puts it, DevOps like an interface in a programming language, class SRE implements DevOps. They illustrate the five DevOps pillars and the corresponding SRE practices as in the below table,
DevOps | SRE |
Reduce organization silos | Share ownership with developers by using the same tools and techniques across the stack. |
Accept failure as normal. | Have a formula for balancing accidents and failures against new releases |
Implement gradual change | Encourage moving quickly by reducing costs of failure. |
Leverage tooling & automation | Encourages "automating this year's job away" and minimizing manual systems work to focus on efforts that bring long-term value to the system |
Measure everything | Believes that operations is a software problem and defines prescriptive ways for measuring availability, uptime, outages, toil, etc. |
Source: https://cloud.google.com/blog/
When an organization decides to adopt the SRE as a new IT strategy, it should understand that it causes a huge change in its culture. If the organization has already implemented a DevOps solution, it may cost an extra amount to shift to SRE.
SRE benefits organizations in many ways, from performance improvement to revenue generation. These are some key benefits SRE delivers to organizations;