System Observability: A Production Engineer's Guide

It is 2 a.m. Your p99 on a critical endpoint just tripled, support is lighting up, and your alerts are quiet. Dashboards look green, CPUs are fine, error rates are flat. The system is breaking and your monitoring has nothing useful to say. This is where system observability earns its place, it would have shown the causal path before the pages started firing.

Monitoring tells you that something is wrong by checking a few known symptoms. Observability lets you ask why from any angle, using the data your system already emits, without predicting the failure mode in advance. That gap is what this guide addresses.

At imlucas.dev, we write from the pager, not the ivory tower. This guide covers the core signals to collect, how to instrument from day one, and the practical telemetry pipeline that takes you from raw data to lower MTTR and higher reliability.

1. System observability vs. monitoring: why the distinction matters

The failure modes traditional monitoring misses

Monitoring excels at known failure patterns: CPU thresholds, disk alerts, uptime checks, and static rules on dashboards. That model works for simple, predictable stacks. In distributed systems with microservices, Kubernetes, and async queues, many outages show up as novel interactions across components that no one wrote a rule for. Static checks have no answer for that.

What APM gets right (and where it stops)

Application Performance Monitoring gives you application-layer metrics like request latency, error rate, and throughput, an improvement over host metrics alone. It often stops at the service boundary, though. When the slowdown comes from a downstream queue that only backs up under a rare traffic mix, APM without distributed tracing points to the symptom, not the root cause. For a concise comparison of these approaches, see APM vs. Observability.

System observability defined plainly

Think of observability as a property, not a product. A system is observable if you can infer its internal state from its external outputs. The practical move is to design for rich, structured telemetry so you can answer new questions after the fact, instead of guessing every dashboard and alert ahead of time.

2. The three telemetry signals and what each one tells you

Logs: the event record

Logs are discrete, timestamped records of what happened: an exception thrown, a request received, a retry attempted. They carry rich context and are the most granular signal you have. Structured logs win at production scale because key-value fields make them filterable and joinable across queries, a consistent finding in production engineering practice, while unstructured text turns into expensive noise.

Metrics: the system’s vital signs

Metrics are numeric measurements over time like request rate, error percentage, p99 latency, CPU utilization, and queue depth. They are cheap to retain, fast to query, and ideal for dashboards and alerting. Use the RED method (Rate, Errors, Duration) for services, a pattern popularized by Weaveworks and widely documented in monitoring best-practice literature, then let metrics tell you that something is off before you go hunting for details.

Traces: following a request through the system

Distributed tracing follows a single request through services as a set of spans, each with timing and metadata. Traces make microservices debuggable because they show where time was spent along the path. They are the highest-value signal for root cause analysis in distributed architectures, and they only work if you instrument consistently and propagate context.

How the three signals fit together

Metrics surface the anomaly, traces isolate the slow path, and logs explain the why with concrete context. Correlate all three with a shared trace ID in your logs so you can jump from a red dashboard to the exact trace, then to the specific error line in seconds. Teams that build this cross-signal correlation into their incident workflow consistently report meaningful MTTR reductions; the exact gains depend on your stack and process maturity, but the direction is reliable. For a complementary perspective on pillars of observability, see Three Pillars of Observability.

3. How to instrument your services from day one

Start with OpenTelemetry, not a vendor SDK

OpenTelemetry is the open standard for instrumentation across languages like Java, Python, Go, and Node.js. It auto-instruments popular frameworks and clients, and the Collector handles batching, retries, and routing to any backend. Adopt it early and uniformly to avoid re-instrumenting if you later switch your observability platform. Following instrumentation best practices from the start (consistent span naming, meaningful attributes, and semantic conventions) pays dividends the first time you need to debug a production incident you did not anticipate. If you need deeper architecture context alongside instrumentation, see System Design: The Complete Engineer’s Guide, imlucas.dev.

Prioritize where you instrument first

Do not boil the ocean. Start at your API gateway or ingress to capture every external request and establish a root span for end-to-end traces. Then cover your highest-volume and most business-critical services, followed by any component that pages people often, and then the message queues and async pipelines where latent bugs hide. This order yields outsized coverage with minimal initial effort. For patterns on scaling services and handling high traffic, see How to Design Systems That Handle Millions of Users, imlucas.dev.

Context propagation and sampling in practice

Propagation is what connects spans across services. Forward W3C Trace Context headers, traceparent and tracestate, on every downstream call, including queue messages. Skip one hop and your trace graph fragments. (The W3C Trace Context spec and OpenTelemetry’s propagation documentation cover the mechanics for both synchronous and async transports in detail.)

Tracing every request is rarely affordable, so use tail-based sampling at the collector to retain error traces and slow traces at a high capture rate while dropping low-value happy paths. Achieving that in practice requires proper buffering, sticky routing by trace ID, and sufficient collector resources. The OpenTelemetry documentation on tail-based sampling walks through the trade-offs. Pair this with conservative head sampling in high-traffic edge services to control overhead.

4. Building a telemetry pipeline that actually scales

The tiered storage model for cost control

Keeping all telemetry hot is a tax you will pay forever. A tiered model balances speed and cost by matching retention and performance to how the data is used. Start simple, then evolve as volume grows past meaningful thresholds. The windows below are common starting points; your compliance requirements and query patterns will push them in either direction.

Hot: 24 to 48 hours on fast SSD-backed stores for real-time dashboards and active incident response.
Warm: 7 to 30 days on a columnar store like ClickHouse for trend analysis and postmortems.
Cold: 30+ days on object storage like S3 for compliance, audits, and long-range capacity planning.

For practical guides on storing OpenTelemetry collector data in columnar backends, see resources from ClickHouse that outline common patterns and trade-offs: Best resources for storing OpenTelemetry Collector data.

Open-source vs. commercial tooling

The open-source stack is proven: OpenTelemetry Collector for ingestion and shaping, Prometheus for metrics, Loki or ELK for logs, Jaeger or Grafana Tempo for traces, and ClickHouse as a unified analytic backend. Commercial platforms offer faster setup with built-in cross-signal correlation and workflow polish. The decision is about operational capacity and budget tolerance, not ideology, and the collection layer should stay OpenTelemetry so you can switch backends without touching application code.

Controlling telemetry volume before it controls your budget

The cheapest byte is the one you never ingest. Filter low-value logs at the collector (health checks, routine polling, chatty debug lines), and you can cut ingest volume significantly with no meaningful loss of signal. The reduction varies widely depending on application noisiness and logging verbosity, but teams with verbose debug logging commonly see 50 to 90 percent drops after a targeted cleanup pass.

Beyond log filtering: cap metric cardinality, apply adaptive and tail-based sampling for traces, compress aggressively, and set lifecycle policies that move data to colder tiers automatically. Reduce volume at the edge, not after you have already paid to store it.

5. System observability in production: from raw telemetry to faster incident resolution

An investigation workflow that actually works

Incidents resolve faster when you narrow the search space step by step. Start wide with aggregate signals, then follow a single request to the failing line of code. Make this process your team’s default and bake it into runbooks.

Scope with metrics: which service, which endpoint, which time window, and how bad based on SLOs.
Pivot to traces: find slow or error traces on the affected path and inspect span timings and attributes.
Correlate logs: jump from the trace ID to logs for the exact stack trace, config change, or payload involved.
Mitigate and verify: roll back, feature flag, or scale, then watch metrics and traces confirm recovery.
Capture learning: tag the trace and relevant logs, then link them into the postmortem for future searchability.

With your pipeline ready and this workflow in place, the next step is wiring alerting to reliability signals your users actually feel, which is where SLOs come in.

How to shift alerts to SLOs

Thresholds on raw metrics either spam you or miss user pain. SLOs align alerting with the reliability users feel, which is the only frame that matters. Tie every page to error budget burn so you respond to user impact, not server trivia.

Define service-level objectives, like 99.5 percent of requests under 500 ms and error rate under 0.5 percent.
Track error budget and alert on burn rate, for example, paging when you spend a day’s budget in an hour.
Route pages to owners with clear runbooks mapped to traces and logs, not just to a noisy CPU alarm.

Make the workflow muscle memory. Drill on recent incidents by replaying metrics to traces to logs, then measuring how long each hop takes. The outcome to watch is MTTR trending down without alert fatigue trending up, the hallmark of effective observability engineering.

Conclusion

System observability is not a bolt-on feature; it is a property you engineer from the start. Logs, metrics, and traces only deliver when they are correlated end to end, stored with a sane tiered strategy, and wired into alerting that reflects SLOs instead of arbitrary thresholds.

The practical path is clear: adopt OpenTelemetry early, follow instrumentation best practices on the critical paths first, propagate context everywhere, and shape data at the edge to keep costs predictable. Anchor your incident workflow on metrics to traces to logs and you will see MTTR move in the right direction.

The next articles in this series go deeper on pipeline architecture, sampling strategies, and postmortem templates, built from the same production incidents this guide draws from. Read more on the blog, including Hello World: Why I Started This Blog, imlucas.dev.