Site Reliability Engineering, Operations, and Production Engineering

CSC491, University of Toronto


We will talk about:


Site Reliability Engineering (SRE)


What is Site Reliability Engineering (SRE)?

SRE is what you get when you treat operations as if it’s a software problem.

This is a phrase that was coined by Google: https://landing.google.com/sre/


What is Site Reliability Engineering (SRE)?

“Site Reliability Engineering seeks to balance the risk of unavailability with the goals of rapid innovation and efficient service operations, so that users’ overall happiness—with features, service, and performance—is optimized.”


What is Site Reliability Engineering (SRE)?

“Here’s what you do when someone breaks something or finds something very difficult to debug: You say thank you. Thank you for finding this edge case. Thank you for highlighting this overcomplicated part of our system. Thank you for pointing out this gap in our docs. And then you go make it so nobody can break it the same way again.”

Tanya Reilly, Google SRE 2005 - 2018


What is Site Reliability Engineering (SRE)?


Production Engineering


What is Production Engineering?

Production engineering is a combination of manufacturing technology, engineering sciences with management science. A production engineer typically has a wide knowledge of engineering practices and is aware of the management challenges related to production.


What is Production Engineering?


What is Production Engineering?


What is Production Engineering?


Operations


What is Operations?

This is a more traditional model of running hardware and servers.

In this model, you have a person or a team who are in charge of running servers on which to host software. Often they will be the ones handling deploys as well.


What is Operations?

This generally scales poorly as you must employ an increasing number of operations people to manage an increasing load, instead of automating it with software


My Opinion of Operations

Personally, I’m not a fan of operations-style management of production systems. It doesn’t scale very well and it requires people to do things that servers are better, and less error-prone, at - click and type configuration.

This may be easier at the beginnging though. Hiring a contractor or part time person who’s job it is to keep the servers running is much easier. When you have few servers this is a very isolatable role from the rest of your product team and lets you focus on your product.


Production Engineering vs. SRE


Production Engineering vs. SRE

Honestly, the systems are very similar. There are shared concepts and each system has pros and cons.


Production Engineering vs. SRE

I’m going to use an example from a Quora answer from a lead recruiter at Facebook. I’ll supplement it with my experience as a Production Engineer at Shopify.

The link to the Quora answer can be found here


The fundamental differences


The fundamental differences

Both are driven by engineering decisions, not hardware-related ones. A good example of this is a system that Facebook built to automate 300 sysadmins work to find, test, and restart servers in production autonomously.

Software is good at detecting repeatable patterns, performing tedious tasks, and completing these 24/7.


Tooling and Automation

SRE and Production Engineering both require heavy amounts of automatic and autonomously functioning tooling and detection. You’ll find that both areas tend to invest heavily in orchestration software (Packer, Chef, Kubernetes, etc), monitoring and alerting (observability), and all things tooling/automation.

When you can save everyone a second a day, but have 1000 developers… you’ve actually saved 17 minutes. Or about 68 hours (1.5 weeks!) of developer time.


Teams and Focuses


Teams and Focuses

This section will focus on teams and skillsets I’ve experienced while working at Shopify, consulting, and working with various companies.


Teams


Teams cont.


Datastores


Networking


Cloud (AWS, Google Cloud, Azure)

Manage integration with services on the cloud, often using orchestration software.


Continuous Integration + Deployment

Manages scaling CI systems for testing infrastructure, decreasing time spend on test runs, and manages quick/easy deployment to production


Developer Tooling


Regulated Systems (PCI Compliance, SOX compliance, etc)

I’ve seen regulated systems (e.g. PCI compliance for Credit Card Processing) be managed by separate teams. This is to reduce the number of people that have access to sensitive systems.


Observability (Logging and Monitoring)

Logging and Monitoring may not seem like a big problem to solve, but each request to a system will often send out dozens of different telemetry metrics and dozens of logs. This means that those systems need to handle a large load.

This team ensures that logging and monitoring can keep up with data thrown at it and ensures it can respond quickly to people who need to use that data.


Language and Platform

Sometimes when your company gets big enough, it starts to drive the full feature development of a framework or language.

Both GitHub and Shopify employ core contributors to Ruby and Rails teams. This allows them to remove painful parts that reduce productivity of their own engineers, but also give it back to the world through open source.

A recent example can be seen in Rails 6. GitHub has a multi-database configuration, Rails did not support this. GitHub’s engineer, Eileen, worked tirelessly with her team to integrate this functionality in the Rails framework. Now everyone gets to use this feature for free and GitHub doesnt have to maintain something that only they use.


Resources