Observability
We all want reliable production systems that can recover from problems fast. Observability is key to making it happen. When your systems are instrumented with monitoring, logging, and alerting tools, humans can see what’s going on, get notified if things go wrong, and take action.
Know the three pillars of observability
Monitoring is the tracking of key metrics and data. Logging is collecting the details of the occurrence of events that are happening in your system. And DevOps teams set up alerting so they’re notified when something dodgy is happening and can jump in and investigate the potential issue. Observability may seem subtle, but it impacts how your instrumentation is designed and can help guide your choice of tools.
Speed up the diagnosis of new problems
Observability is about focusing the design of your instrumentation on speeding up the diagnosis of new problems, instead of monitoring for things you’ve seen before. DevOps teams often fall into the trap of creating multiple dashboards that only show things that have already been instrumented.
You've probably seen a dashboard that shows memory usage and lists of errors, as well as performance data. This helps as a general overview of the health of your service, but during a call out or investigation, engineers spend time browsing these dashboards for anomalies and then potentially go on to investigate by diving into logs. The result? Valuable minutes are lost. And, in order to create such a dashboard, specific instrumentation has been built into your code and infrastructure.
Choose tools that bring you built-in visibility
A design more focussed on observability would instead choose tools that provide more built-in visibility. The incident response workflow should go from receiving a callout to the triggering event and then immediately to searching and correlating event logs from related systems relevant to the source of the call-out. The workflow focuses more on asking questions relevant to the situation than browsing questions you asked before.
To quickly diagnose issues, you need to be able to see interactions between processes and services easily. Service meshes are your friend here. And you need to store logs so interactive searches can be super-fast with tools like Prometheus.
The takeaway? Designing for observability, rather than monitoring, helps reduce time to recovery during an incident. And it focuses instrumentation on finding the cause of a problem and fixing it.