SRE (site reliability engineering) is a software engineering approach to IT operations. It takes tasks that have been done by operations teams manually and gives them to engineers or Ops teams to solve.
Definition of Site Reliability Engineering
The concept of SRE was first introduced in 2003. It came from the Google engineering team—the concept itself is credited to Ben Treynor Sloss. Originally, it was a framework to support developers in building large-scale applications. Now, site reliability engineering is carried out by experts who apply engineering practices to solve common problems. Software and system engineers have a wide variety of responsibilities—from writing the code, through shipping it, ending at owning the code in production.
Site reliability engineering teams use software as a tool to manage systems and automate tasks. SRE is a very valuable practice when creating scalable software systems. Thanks to SRE, you are able to manage large systems through the code. The code itself is much more scalable and sustainable for admins who manage multiple machines. Some engineers say that SRE is a more proactive form of QA. SRE focuses on improving software system reliability across several key categories: performance, availability, latency, capacity, efficiency, and incident response. The most important objective of SRE is to develop a highly reliable and scalable app or software.
Why implement SRE?
First, site reliability engineering helps developers find a balance between releasing new functionalities and making sure that they are reliable for users. SRE focuses on automation—one of its goals is to reduce duplication or redundancy of effort as much as possible. Manual tasks are automated which enables you to allocate your resources more efficiently and at the same time gives developers much more time to innovate. Companies that have implemented SRE deliver new features into production more quickly. The approach focuses on maintaining the platform or service no matter what as one of SRE’s goals is to find the best ways to prevent problems that can cause downtime. Also, the gap between Dev and Ops is bridged – responsibility for detecting reliability and performance issues early in the life cycle is shared by the teams.