Why Logging, OpenTelemetry, ELK, Grafana & Prometheus Matter in the Distributed World
I’ve been building backend systems since the early nineties. Back then, you had a server. You logged in, you looked at the logs, you ran a profiler. When something broke, you could usually point at one process, one machine, one clear failure mode. It wasn’t easy, but it was contained.
Today, a single user request might hop through five microservices, hit three databases, and call two external APIs—all in under a second. When something breaks, you’re not debugging one application anymore. You’re piecing together a story across an entire ecosystem. And if you don’t have the right tools, you’re doing it blind.
That’s what observability is. It’s not a buzzword. It’s the difference between “everything’s fine” and actually knowing what’s happening inside your system. After thirty years, I’ve learned that the teams that invest in observability early sleep better at night. The ones that don’t pay for it later, in 3 a.m. pages and war-room weekends.
Let me walk you through what I’ve seen work.
The Problem: You Can’t SSH Into a Request
In the old days, you had one server. Something went wrong? You’d SSH in, check the logs, maybe attach a debugger. The process was right there. The state was right there. Simple.
With microservices, that request is gone before you can blink. It touched Service A, which called Service B, which talked to a message queue, which triggered Service C. Where did it fail? Which service was slow? Was it a timeout? A bad config? A cascading failure? Good luck figuring that out with just a prayer and a tail -f.
I’ve sat in incident rooms where we had logs from six different services and no way to correlate them. We knew something was slow. We didn’t know what. We added logging. We added more logging. We still didn’t know. The request ID wasn’t propagated. The timestamps were in different timezones. By the time we’d pieced it together, the outage had been going on for two hours.
You need visibility. And that visibility comes from three places: logs, metrics, and traces. Each serves a different purpose. Together, they give you a picture you can’t get from any one of them alone.
Logs: Your First Line of Defense
Logs are the breadcrumbs. Every time something happens—a request comes in, a DB query runs, an error gets thrown—you write it down. In a monolithic world, that might be enough. In a distributed world, you need logs that are:
-
Structured—JSON, not free-form text. So you can search and filter. “Show me all errors from service X in the last hour.” That query should be trivial. With unstructured logs, it’s a grep nightmare.
-
Correlated—Same request ID across every service. When a request goes from API Gateway to User Service to Order Service, they all log the same trace ID or correlation ID. Now you can follow the journey. I’ve seen teams add this one change and cut debugging time in half.
-
Centralized—All in one place. You’re not SSH-ing into 50 boxes. You’re not copying log files. You’re querying.
That’s where the ELK Stack shines. Elasticsearch stores and indexes your logs. Logstash (or Beats) ingests and processes them. Kibana lets you search, visualize, and build dashboards. It’s the industry standard for a reason. Once you’ve got your logs flowing into Elasticsearch with a consistent structure, you can find anything. The Elasticsearch documentation and Kibana guide are solid starting points if you’re new to the stack.
One lesson from the trenches: log at the right level. Too verbose, and you drown in noise. Too sparse, and you miss the critical moment. I aim for INFO in normal operation, DEBUG when I’m actively debugging, and structured ERROR with enough context to reproduce. And always include that correlation ID.
Metrics: The Numbers That Tell a Story
Logs tell you what happened. Metrics tell you how much and how fast.
How many requests per second? What’s the error rate? What’s the p95 latency? What’s the p99? These numbers let you spot trends, set alerts, and catch problems before users do. I’ve seen teams discover a gradual memory leak because their heap usage metric was slowly climbing. I’ve seen others catch a deployment that increased latency by 40% because they had a dashboard and looked at it.
Prometheus is the go-to for metrics in the cloud-native world. It scrapes your services on an interval, stores time-series data, and gives you PromQL—a query language that’s surprisingly powerful once you get the hang of it. Counters, gauges, histograms—it’s all there. The Prometheus data model and metric types docs explain the fundamentals clearly.
Prometheus is also the default metrics backend for Kubernetes. If you’re running containers, you’re probably already in Prometheus territory. The ecosystem—Grafana for dashboards, Alertmanager for alerting—is mature. You’re not building from scratch.
A word on what to measure: start with the golden signals. Latency, traffic, errors, saturation. For a web API, that might mean request duration, request count, error count, and CPU or memory utilization. Add more as you learn what matters for your system. But don’t measure everything. I’ve seen teams export hundreds of metrics and never look at most of them. Focus on what you’ll actually use.
Traces: Following a Request Across the Map
This is where it gets interesting. A trace is the path a request took through your system. One trace ID, multiple spans—each span is one operation in one service. You can see the full journey: “Request hit API Gateway → User Service → Order Service → Payment Service → Database.” You can see which span was slow, which one failed, and how they’re connected.
The first time I used distributed tracing, it felt like someone had turned on the lights. We had a request that was taking 8 seconds. We had no idea why. We turned on tracing. The trace showed 7.5 seconds in a single database query we’d never suspected. One query. We’d been guessing for days.
OpenTelemetry is the open standard for this. It’s vendor-neutral, supported by all the big players, and it gives you a single way to instrument your apps. Traces, metrics, and logs—all from one SDK. The observability primer is a great read if you want to understand the “why” behind it. For traces specifically, check out the traces documentation.
The beauty of OpenTelemetry is that you instrument once and can send data to Prometheus, Grafana, Datadog, Jaeger, or whoever. No lock-in. The industry has converged on this, and I think that’s a good thing. We spent too many years with proprietary agents and vendor-specific formats.
Grafana: Where It All Comes Together
You’ve got logs in ELK, metrics in Prometheus, traces in Jaeger or Tempo. Now you need to see them. Grafana is the visualization layer. Dashboards, alerts, correlation—you can pull in data from Prometheus, Elasticsearch, Loki (Grafana’s own log aggregator, lighter-weight than ELK for some use cases), and trace backends like Tempo or Jaeger.
Their observability overview explains the big picture well. Grafana doesn’t replace your data stores; it sits on top and makes them useful. A good dashboard answers the question “is my system healthy?” at a glance. A great one lets you drill from “something’s wrong” to “here’s the exact trace” in a few clicks.
How They Work Together
In practice, you’ll use them in combination. Here’s the flow I’ve seen work:
-
Prometheus scrapes metrics. You set alerts when error rate spikes or latency degrades. The alert fires at 2 a.m.
-
ELK (or Loki) has the logs. You search by time range and service. You find the error. You see the stack trace. You see the request ID.
-
OpenTelemetry produced traces. You take that request ID, look up the trace. You see the full path. You see which span timed out. You see it was a database call in the Order Service. Now you know where to look.
-
Grafana ties it together. One dashboard with metrics, log snippets, and trace links. You don’t context-switch between five tools. You follow the thread.
It sounds like a lot. And it is, at first. But once it’s in place, debugging a production incident goes from “we have no idea” to “here’s the exact span that timed out, here are the logs from that moment, and here’s the metric that spiked.” The difference in mean time to resolution is dramatic. I’ve seen it cut from hours to minutes.
Start Small
You don’t need everything on day one. I’d start with structured logging and a correlation ID. That alone will help. Add Prometheus metrics for your critical paths—the endpoints that matter, the dependencies that can fail. Introduce tracing when you have more than a couple of services talking to each other. Each piece makes the next one more valuable.
The distributed world is complex. These tools are how you make it manageable. They’re not optional anymore. They’re the difference between a team that can operate with confidence and one that’s always one incident away from chaos.
Invest early. Your future self will thank you.