In the book Site Reliability Engineering, contributor Benjamin Treynor Sloss—the originator of the term “Site Reliability Engineering”—explains how SRE emerged at Google:
SRE is what happens when you ask a software engineer to design an operations team. When I joined Google in 2003 and was tasked with running a “Production Team” of seven engineers, my entire life up to that point had been software engineering. So I designed and managed the group the way I would want it to work if I worked as an SRE myself. That group has since matured to become Google’s present-day SRE team, which remains true to its origins as envisioned by a lifelong software engineer.
SRE arose partially as a response to the division between product development and operations teams. Treynor Sloss explains this division in Site Reliability Engineering:
At their core, the development teams want to launch new features and see them adopted by users. At their core, the ops teams want to make sure the service doesn’t break while they are holding the pager. Because most outages are caused by some kind of change—a new configuration, a new feature launch, or a new type of user traffic—the two teams’ goals are fundamentally in tension.
But what would happen if these teams weren’t “fundamentally in tension”? How might that improve product development, operations, and the business itself? Treynor Sloss continues in Site Reliability Engineering:
Conflict isn’t an inevitable part of offering a software service. Google has chosen to run our systems with a different approach: our Site Reliability Engineering teams focus on hiring software engineers to run our products and to create systems to accomplish the work that would otherwise be performed, often manually, by sysadmins.