Site reliability engineering (SRE), platform engineering, and DevOps are all related concepts around optimizing and giving structure to different aspects of the software delivery and software deployment lifecycles. Some of these terms are used interchangeably, and in some engineering organizations you might even find hybrid roles (“DevOps platform engineer” and so on).
“A platform engineer is a software engineer who builds tools. A DevOps engineer is a title created by companies who don't understand DevOps.”— Bob Dobbs, Quora
What is site reliability engineering?
SRE is commonly thought of as taking a software engineering approach to IT operations. SRE teams create and maintain the systems that allow applications to run reliably in production. The development of SRE is attributed to Google (who has literally written the books on the discipline), and it can be thought of as an evolution of production engineering.
Cynics call this a rebranding of production engineering (similar to how some companies started hiring “DevOps engineers” for what were ostensibly sysadmin roles), but there is a distinction. Where previously production engineers were responsible for keeping the production environment stable and up to date, the SRE role has more of a focus on proactive architecture and automation of production processes to optimize uptime, scalability, and performance of your application.
SREs are accountable for metrics that give a fuller picture of how healthy and reliable an application is. The role of an SRE usually covers these common themes:
- Availability: Service level objectives (SLOs) measure whether an application’s availability is at an acceptable standard, while error budgets aim to quantify the amount of downtime your users will tolerate. Error budgets are used to help teams to strike a balance between reliability and innovation.
- Performance: If your application is reliably available, your SRE team might work on performance next, including improving page load times or latency.
- Monitoring and observability: These practices measure the health of your systems and allow SREs to get proactive alerts about performance issues and to troubleshoot them quickly.
- Incident response: While SREs aim to put the guardrails in place to prevent them, incidents aren’t always avoidable, so in the event of an outage or attack, SREs are usually the ones on call to respond. They may also conduct post-incident reviews.
What is platform engineering?
Let’s start with what a platform is:
“A digital platform is a foundation of self-service APIs, tools, services, knowledge and support which are arranged as a compelling internal product. Autonomous delivery teams can make use of the platform to deliver product features at a higher pace, with reduced co-ordination.”— Evan Bottcher, What I Talk About When I Talk About Platforms
Whether you intend to or not, you probably already have a platform of some description—possibly as many as the number of product engineering teams you have.
With the proliferation of cloud-native technologies and microservices, developers are able to self-serve a lot of the elements that make up the build and running of their applications, where previously ops teams would have defined these. Some developers welcome this freedom, while others grumble about DevOps forcing them to do operations work too (more on that below).
Without guardrails, this proliferation of technologies introduces complexity and impacts your resource spend. Platform teams emerged as a way to standardize and centralize companies’ platforms into a single, internal developer platform.
By having one team focused on building and maintaining the platform that integrates all of these services and enables continuous delivery, your developers can leverage the services to deliver their software more quickly, without the burden of learning Kubernetes, for example.
You can read more about platform engineering and whether it's the right fit for your organization in our Guide to Platform Engineering: Everyone’s Doing It, Should You Be Too?
SRE vs. platform engineering
Site reliability engineers (SREs) and platform engineers have common goals:
- automating processes
- reducing time spent on manual tasks
- minimizing room for human error
However, their areas of focus are different.
While platform engineers are responsible for building and maintaining the platform that enables self-service for developers, SREs’ core responsibility is also in their title: site reliability. Reliability, scalability, and availability of the application or product—these are the common metrics by which SRE teams are evaluated.
Bjorn Freeman-Benson and Richard Li of Ambassador Labs also note that a motivation for building these teams separately is the difference between the types of people who thrive as platform engineers versus SREs:
“While both SREs and platform engineers need strong systems engineering skills in addition to classic programming skills … SREs tend to enjoy crisis management and get an adrenaline rush out of troubleshooting an outage … On the other hand, platform engineers are more typical software engineers, preferring to work without interruption on big, complex problems.”— Bjorn Freeman-Benson and Richard Li, SRE vs. Platform Engineering
Platform engineering can support site reliability engineering by making it easier for developers to ship production-ready code while abstracting away the intricacies of software delivery.
What is DevOps?
DevOps is a set of practices that aim to bridge the divide between software development and operations teams.
Before DevOps, development (dev) and operations teams (ops) worked in silos: developers would write code without understanding the production environment where the code would eventually run, and operations teams had little insight into how the software actually worked, despite being responsible for its successful deployment.
This disconnect resulted in errors in production and inefficient debugging. Enter DevOps: with a focus on improved collaboration between the two disciplines, streamlining the software delivery lifecycle, and automation of repetitive tasks while minimizing room for human error. In many ways, the ideal outcomes of DevOps practices and platform engineering are the same.
Platform engineering vs. DevOps
DevOps predates platform engineering, having been around since the early 2000s, while platform engineering is still in the early stages of adoption, having just entered the Gartner® Hype Cycle™ for Software Engineering in 2022.
DevOps culture has often been poorly implemented or paid lip service without true organizational change, as demonstrated in the comment at the beginning of this post.
We explore this more in our guide “What Is DevOps? (And How To Set Your Organization up for DevOps Success).”
It’s not surprising then that industry professionals are eager to find and adopt DevOps’ successor: the practice or methodology that will deliver the rewards that didn’t materialize from DevOps adoption. For many, platform engineering is promising: seen as an evolution or extension of DevOps that can help organizations achieve the reduced overhead, improved coordination, and faster cycle times they seek.
Platform engineering = DevOps in practice?
“As we know, DevOps uses tools to streamline deployment, management and monitoring using automation and visualization. Platform engineering takes these tools, processes and best practices and productizes them as reusable services and tools for use across the different engineering teams and use cases in the organization.”—Dotan Horovits, Platform Engineering: DevOps evolution or a fancy rename?
Platform engineering efforts can enable your developers to manage their applications in production without having to understand the entire ecosystem. By creating and maintaining the internal developer platform and automating away the burdensome tasks that development teams don’t want to take on, your platform team also frees up operations teams from having to put out fires or serve as internal support for developers whose code isn’t behaving in production.
A platform engineering team on its own isn’t going to create the cultural or organizational shifts proposed by DevOps; but it can support these efforts.
“The existence of a platform team does not inherently unlock higher evolution DevOps; however, great platform teams scale out the benefits of DevOps initiatives.”—2021 State of DevOps
SRE vs. DevOps?
“SRE works from Production backward. DevOps works from development forward. Somewhere in the middle, they meet.”— Gary Pochron
There is some overlap, but the two disciplines focus on systematizing and supporting different stages of the software development and software delivery lifecycles. DevOps teams and platform teams are concerned with enabling developers to focus on application development and increase their velocity, allowing them to get new features and functionality to production quickly and successfully. SREs are responsible for a healthy, reliable production environment for that new code to run in.
For more, check our a recent webinar below in which we spoke with author Gene Kim on the future of DevOps.