Prometheus: How I Actually Started Understanding My Systems

Before I understood Prometheus, I thought of monitoring as something that collected logs and surfaced them somewhere. You store events, you search them, you find the problem. That mental model is not wrong exactly, but it leads you to build the wrong things.

Prometheus is not a log system. It does not care about individual events. It cares about the current state of your system, sampled over time. That distinction sounds subtle and it completely changes how you think about observability.

The Pull Model

Most monitoring tools I had used before were push-based. Your application sends data to some collector. Prometheus flips this. It pulls. Every few seconds, Prometheus makes an HTTP request to your application's /metrics endpoint and reads whatever is there.

The first time I saw this I thought it was backwards. Why would you want the monitoring system to be in charge of timing? But it makes sense once you think about it. If an application stops responding, Prometheus knows immediately because the scrape fails. With push-based systems, you have to separately monitor whether the agent is still sending. With Prometheus, silence is itself a signal.

It also means your application does not need to know anything about where it is being monitored from. You expose /metrics, and whatever is scraping it is someone else's problem. Clean separation.

The Data Model

Every metric in Prometheus is a combination of a name and a set of labels. Labels are key-value pairs that let you slice the same metric in different ways.

Take HTTP request duration. You could have one metric called http_request_duration_seconds with labels for method, route, and status_code. That single metric, with the right labels, lets you answer: how long do GET requests take, how long do POST requests to /orders take, how long do requests that return 500 take. All from one thing.

I got this wrong early on. I was creating separate metrics for each route, which made the PromQL queries messy and the Grafana dashboards brittle. Labels are the right tool. One metric with good labels beats ten metrics with hardcoded names.

Counters, Gauges, Histograms

Prometheus has four metric types. I use three of them regularly.

Counters only go up. Total requests processed, total errors, total bytes sent. They reset when the process restarts, which is fine because you query them as rates, not absolute values. rate(http_requests_total[5m]) gives you requests per second over the last five minutes, regardless of restarts.

Gauges go up and down. Current memory usage, number of active connections, queue depth. You query these as-is because the current value is what matters.

Histograms are the most useful and the most misunderstood. They track the distribution of values, not just the count. For latency this is critical. An average tells you very little. A histogram lets you compute the 95th or 99th percentile, which is where real user pain lives. The http_request_duration_seconds metric that prometheus-net creates for you automatically is a histogram. Treat it accordingly.

PromQL Is Worth Learning Properly

PromQL is Prometheus's query language and it is genuinely different from SQL. The learning curve is real. But it is worth the investment because once you understand it, you can answer almost any question about your system directly from the data you are already collecting.

The things that tripped me up early: you almost always want rate() on counters, not the raw value. The time range in square brackets ([5m]) is a lookback window, not a grouping. And by in aggregations works like GROUP BY in SQL but you specify the dimensions you want to keep, not the ones you want to collapse.

Spend an afternoon in the Prometheus expression browser just writing queries against your own data. It clicks faster than reading documentation.

Where Grafana Comes In

Prometheus has its own UI. It is fine for writing and testing queries, but it is not where you want to spend time looking at production data. That is Grafana's job.

Grafana talks to Prometheus as a data source and turns PromQL queries into panels. The connection is straightforward. What matters is that Grafana does not store anything itself in this setup. All the data lives in Prometheus. Grafana is just a rendering layer. You can change your dashboards, delete them, rebuild them from scratch, and your underlying metrics data is untouched.

This is a good way to think about it: Prometheus is the database. Grafana is the reporting tool. They do different jobs and you should configure them separately.

Retention and Storage

By default Prometheus keeps data for 15 days. That is enough for operational dashboards but not for longer-term trends. If you want months of history you have two options: increase the retention and give Prometheus more disk, or use remote storage like Thanos or Cortex to push data somewhere else.

For most projects I have worked on, 30 days of local retention covers 90% of the questions anyone actually asks. The conversations about "what did memory usage look like six months ago" happen less often than you think, and when they do, you can usually answer them from deployment notes and incident records rather than raw metrics.

What I Would Tell Myself Earlier

Start with the defaults. The prometheus-net library gives you HTTP metrics, runtime metrics, and process metrics out of the box. That alone is more useful than most custom instrumentation I have seen people build from scratch.

Add custom metrics only when you have a specific question the defaults cannot answer. "How many orders are in the pending state right now" is a good reason to add a gauge. "I should track everything just in case" is not.

And if you are already using Grafana, the Prometheus integration is the one that will make it feel like a genuinely useful tool rather than a nice-to-have. The two are designed to work together, and it shows.