A primer on systems quality and monitoring

June 23, 2024 · 6 min read

Digital plumber, organizational archaeologist and occasional pixel pusher

An important part of the job of a software engineer is to ensure that our systems operate as expected. Establishing a feedback loop between the features a team develops and how it behaves in production is one of the traits of high performing teams¹, and the idea that "you build it, you operate it" goes a long way to alleviate a lot of the toxic interaction patterns between traditional development and operations teams.

At the time of this writing, the excesses of a zero interest rate policy are being purged (often excessively so) from organizations. Throughout this industry-wide adjustment process, it is common to find teams of developers that are missing the critical know-how, mental models and hard earned experiences of senior devops practitioners. This situation is made worse by the fact that by-and-large practices around operations, monitoring and observability are taught on the job. So teams are left to sink or swim - they lack a basic theory on how to approach this problem.

In this post I will briefly introduce some of my ideas on this topic, hopefully serving as a primer for you to think and dive deeper into this subject.

Operations are hard! Software systems, no matter how well structured, operate in a messy organizational context. Decisions that made sense two months ago can be viability crushing because a competitor launched a new product, or new regulation was passed, or a war started (see the concept of hyperliminal systems²). Systems also operate at capacity³ - new practices/technologies/methodologies open new perceived opportunities to create value and systems and teams are expected to incorporate them (we're seeing this now with Generative AI). On top of this, the hard work of keeping systems up and running is often invisible. As with much of the infrastructure that makes modern life possible, we only become aware of its existence and its complexities, sometimes acutely so, when something breaks.

So, why do we need operations in the first place? It's hard, often unappreciated work, so what's the deal? I will argue that if you care about the "quality" of what you deliver - as any software engineer should - you definitely need to care about the operational aspects of the systems under your purview. This leads us to an interesting question: what is "quality"? If you ask your teams, who are usually under pressure to deliver the next shiny feature, this question you may get some blank stares, puzzled looks and mumbled semi-articulated responses. The pragmatic definition I would like to offer in this context is that "quality" is the result of a system operating according to its specification (this includes system outputs, behavior and resource usage). For example, if you care about serving web pages, quality may be defined by a combination of response time (99% of requests are served under X milliseconds) and errors (99.99% of requests return an HTTP status 200 to the client).

As with a lot of other software engineering related activities, it is important to keep in mind that you will probably benefit from an iterative approach to understanding what does "quality" entail in your specific context.This is an exercise in progressively getting a better understanding of your system (until you are fundamentally surpirsed and need to ditch some of your assumptions). Formulate an hypothesis on what makes sense to measure, implement it, and reflect on whether or not your expectations were met. This disciplined approach of creating feedback loops to make sense of your system and course-correct, is really at heart of a healthy software engineering practice.

Equipped with this definition and methodology, we can start probing in more concrete directions in order to paint a more complete picture of the quality aspects of a system. As per the definition of quality introduced above there are two facets to monitor:

System outputs and behavior: is the system achieving its goal, i.e. satisfying its clients, with an acceptable performance?
- What indicators reflect the customer/stakeholder's experience of my system? Often you may have to resort to proxy indicators, and that's okay. Equipped with a careful selection of indicators, you can craft meaningful Service Level Objectives. For a deep dive into this topic I recommend this book⁴;
- An often overlooked aspect is the quality of the data your system produces. Returning an HTTP Status 200 100% of the time but producing bad/corrupted/inconsistent data 10% of the time is not great. Enforcing strict schemas is a great starting point - these act as contracts between the system and its clients - and is all you need in many cases. However that may be insufficient, often responses are complex, with optional fields, and subtle dependencies between values that are hard to capture in with tools like JSON schema or interface definition languages like AVRO/Protobuf. There are interesting practices that can be adopted from the data engineering and data science worlds, like data quality scores⁵.
Resource usage: is the goal being achieved within the economic constraints of the organization?
- This is typically where teams start their monitoring and observability journey, as these metrics are usually readily available. There are several good methodologies that inform how you can monitor resources (e.g. CPU, RAM) and how to use them when faced with an operational issue (e.g. the USE or the RED methods⁶);

This list is obviously incomplete, and it should be viewed more as a starting point to a conversation, rather than an authoritative recipe on what to measure - remember: Context is king, so your mileage will vary.

The most important takeaway is the lens with which teams can chip away at this problem. It boils down to a pragmatic usage of the scientific method⁷: formulate an hypothesis regarding what indicators/facets of your system you need to monitor to achieve a certain objective, implement it, observe the results and reflect. Rinse and repeat.

Accelerate ↩
Residues: Time, Change, and Uncertainty in Software Architecture ↩
Beyond Simon's Slice: Five Fundamental Trade-Offs that Bound the Performance of Macrocognitive Work Systems ↩
Implementing Service Level Objectives ↩
Data Quality Score: The next chapter of data quality at Airbnb ↩
Systems performance: Enterprise and the Cloud - Chapter 2 ↩
Strictly speaking, on this case we are not trying to disprove an hypothesis. It's more about empirically making sense of what works in your context, and experiment in small steps. ↩

Footnotes​

Footnotes