Building your Observability plan
The key concept behind running a successful business on the internet is knowing in real time how your business is doing. Knowing if your key transactions are in a healthy state and operate around the estimated/optimal level, and getting fast relevant notifications to the adequate people when they don’t will make a difference between succeeding and failing.
But how can you achieve this level of awareness? Observability.
Definition
Observability: measure of the ability of the internal states of a system to be determined by its external outputs.
Bringing the definition to more understandable terms, we want to be able to figure out what’s going on in our system/application interpreting the information gathered from the system itself, including measures and failure mode outputs.
In our particular IT context, that information comes from many different sources like infrastructure logs, application traces, metrics from the environment where the application runs, but also as well information from complementary systems like third party dependant systems (f.e. email relays or external data processing pipelines).
# Elements of an Observability plan
Observability strategies are constructed on top of three core practices:
Instrumentation: the process to enable the measure of a product's performance and diagnose errors based on writing trace information.
Monitoring: the process to gather metrics about the operations of an IT environment's hardware and software to ensure everything functions as expected.
Alerting: the process inside monitoring tools that generates alerts to notify your team of changes, high-risk actions, and/or failures in the IT environment.
# Planning your Observability Strategy
It's important to understand this strategy is not only a tech responsibility. The primary goal for establishing an observability goal is business continuity, so first thing in the list should be to embark business and product people in the process.
A basic workflow to define this strategy could be:
- Identify your business critical flows
- Identify the systems and applications involved in each of the flows, including any third party dependency
- Define the different states of impairment you want to identify, f.e. HEALTHY, IMPAIRED and DOWN
- Map the impact of each system and application in the state of each flow
- Define the metrics, logs and checks, along with the reference and relevant values to check, and instrument those for each involved component
- Gather all the relevant data and set up alerting for all relevant scenarios, and assing those alerts to be received by the right people in your response team: SRE Engineers, Application developers/maintainers, business decision makers, etc.
# What to measure?
That's a very complex question to answer, as each system is going to require a very specific set of measures/KPIs to follow. My suggestion would be to first focus on the most relevant/obvious things and refine iteratively on the list.
No matter what, all observability builds on top of three basic telemetry elements:
- Logs: Time-stamped records of events inside a system
- Metrics: Discrete values, those can include many different measures, from Key Performance Indicators (KPIs) to infrastructure resource usage like CPU
- Traces: A grouped set of records describing a complete request or transaction flow end to end, from the user interface to the final code processing the request and back to the user. Each system or component involved adds it's own data (span) to the trace to generate a complete view of the process.
It will be your job to define the right type of element and how to combine them to meet your observability needs. Here's a list with a few ideas you can use as a starting point:
- Infrastructure uptime and any other infrastructure critical value, f.e. concurrent connections, latencies or resource usage
- Request logs to trace volumes and constraints, f.e. HTTP request logs or Authentication requests logs
- Business key transactions, f.e. number of sales or user registrations
- Business KPIs, f.e. conversion ratios or user session duration
- Full web request Traces to f.e. identify contributed time from each component to the global response time
# Is that all?
So you have your strategy defined and hopefully implemented... now what? Sadly, observability is a live practice, changing and evolving every minute. It will require minor adjustmens here and there as you refine your monitoring needs, and some big revisions when features or business flows change.
One of the biggest drivers for change in this strategy is failing to react to an incident, and you better assume this will happen no matter how much time and effort you've put into your observability plans. Something will be missing, overseen or not accurate enough, but that's ok and an intrinsic part of our job as engineers. The insights on what happened will probably be crucial and very helpful to define new sets of logs, traces and metrics to prevent it from happening again.