Observability
Every microservice must have at minimum a Holistic View dashboard. The four dashboard types below are the standard set.
Dashboards are where observability becomes usable during real incidents. An alert may tell you that something is wrong, but the dashboard is what helps you understand scope, direction, and likely cause quickly. This guide focuses on building dashboards that support fast decisions instead of collecting decorative charts.
Quick Definitions
- Holistic View: the primary dashboard that answers whether the service is healthy right now
- APM: Application Performance Monitoring, typically used for latency, throughput, tracing, and error analysis
- Threshold line: a visual marker showing when a metric is approaching or crossing an alert boundary
- Deployment marker: a timestamp annotation showing when a new version was released
- Panel: one chart or widget on a dashboard
Why Dashboards Matter
Alerts tell you something is wrong. Dashboards tell you what and why. A good dashboard lets an on-call engineer determine in under 30 seconds whether a service is healthy, without querying logs or running commands.
Key takeaway: dashboards are operational tools, not status posters. Their job is to speed up diagnosis.
Dashboard Design Principles
These principles keep dashboards readable under pressure, when responders have very little attention to spare.
| Principle | What it means |
|---|---|
| Answer a question | Every dashboard should have a title that states the question it answers: “Is the API healthy?” not “API Metrics” |
| Top-to-bottom flow | Put the most critical signals at the top (error rate, latency, throughput). Drill-down metrics go below |
| Consistent time ranges | All panels on a dashboard should use the same time window. Mixed windows cause false correlations |
| Minimal, not maximal | 8 well-chosen metrics > 40 metrics nobody reads. If it’s never been looked at during an incident, remove it |
| Threshold lines | Draw the alert threshold as a reference line on every gauge panel. Engineers instantly see proximity to alert |
| Deployment markers | Overlay deployments on time-series charts. 80% of incidents follow a deploy |
| Actionable panels only | If seeing a metric doesn’t lead to an action, it doesn’t belong on the on-call dashboard |
Key takeaway: Dashboard design is mostly about editing down to what helps action, not adding every available metric.
Dashboard Capabilities
Not every dashboard serves the same audience. Mixing business reporting, engineering debugging, and infrastructure triage into one page usually makes all three worse.
| Capability | Description | Data Source | Tool |
|---|---|---|---|
| Reporting (Business Metrics) | Daily/weekly reports e.g. order counts and states from transactional data, aggregated over time. Focus: past and present trends. | Postgres (read replica) | Looker Studio / Metabase |
| BI | Business intelligence technologies and processes for enterprise data analysis. Focus: future trends. | BigQuery | Looker Studio |
| Analytics / Data Science / ML | Discover hidden patterns using math and statistics. Focus: past and present. | BigQuery | — |
| APM / Infra / Log-based Metrics | Throughput, latency, 4XX/5XX errors, logs-based custom metrics | Datadog | Datadog |
| Billing Reports | GCP resource usage costs | BigQuery | Looker Studio |
| Code Quality Reports | Test coverage, duplication, security vulnerabilities, tech debt | SonarQube | SonarQube |
| DORA Metrics (4 Key Metrics) | Deployment frequency, lead time to change, MTTR, change failure rate | Custom | Custom |
Key takeaway: Choose dashboard tooling based on the question being answered, not on the tool your team happens to know best.
Metrics Tooling Matrix
This matrix is useful as a default ownership and tool-selection reference.
| Metric Domain | Tool |
|---|---|
| Infra | Datadog / Google Console |
| JVM | Datadog |
| APM | Datadog |
| Functional Metrics | Datadog (based on logs and metrics) |
| Postgres Metrics | Datadog / Google Console |
| Redis Metrics | Datadog / Google Console |
| Log Monitoring | Datadog |
| Pub/Sub Monitoring | Datadog |
| Error / DLQ Monitoring | Datadog / LCT |
| APIGEE Monitoring | Datadog |
| Daily Reports from DB | Metabase |
| Daily Reports from Datalake | Google Studio |
Key takeaway: Standardizing tooling by metric domain reduces confusion during incidents and handoffs.
Sensible Defaults – Required Dashboards Per Service
Every service does not need every possible dashboard on day one, but it does need a minimum observability footprint.
Every microservice must have these four dashboards:
| Dashboard | Audience | Refresh Rate |
|---|---|---|
| 1. Holistic View – Mandatory | On-call engineer, team leads | 30 seconds |
| 2. Infrastructure View | SRE, platform team | 1 minute |
| 3. DB View | DBA, backend engineers | 1 minute |
| 4. JVM Detailed View | Java engineers, performance debugging | 1 minute |
Key takeaway: Start with a mandatory minimum set, then add specialized views only when they support a clear need.
Holistic View Layout
The holistic view is the fastest way to decide whether the service problem is in the app, a dependency, the database, the queue, or the runtime.
The holistic view is the single pane of glass for a service. Every metric marked ✅ is mandatory.
| Category | Metric | Mandatory | Why |
|---|---|---|---|
| App Metrics | Throughput | ✅ | Baseline => are we getting requests? |
| Latency 95th percentile | ✅ | User experience => are requests fast? | |
| Overall Error Rate | ✅ | Health signal =>are requests succeeding? | |
| 4XX Errors | ✅ | Client errors => are clients sending bad requests? | |
| 5XX Errors | ✅ | Server errors => is our code broken? | |
| Custom Errors | — | Domain-specific signals | |
| External Calls | External Call Latency 95th percentile | ✅ | Dependency health=> are upstreams slow? |
| External Call 4XX, 5XX Error | ✅ | Dependency failures=> are upstreams broken? | |
| DB Calls | Custom DB Call Latency 95th percentile | ✅ | DB health proxy => slow queries surface here |
| Custom DB Errors | — | Domain-specific query failures | |
| Pub/Sub | Oldest Unacked Age | ✅ | Consumer lag => are we keeping up with the queue? |
| Unacked Message Count | ✅ | Queue depth => are messages accumulating? | |
| Subscriber Latency | ✅ | Processing time per message | |
| Redis | Memory Size => total vs used | ✅ | Cache saturation => eviction risk? |
| Postgres | Transactions per second | ✅ | DB throughput baseline |
| Postgres connections | ✅ | Pool saturation=> heading toward max_connections? | |
| JVM | Thread states | — | JVM deep-dive (separate dashboard) |
| JVM Threads | — | ||
| JVM Memory | — | ||
| GC Pause Time | — | ||
| Hikari Connection Pool | Total connections (idle, pending, active) | ✅ | Pool health=> pending > 0 for > 30s = problem |
| Min / Max connection | ✅ | Config sanity check | |
| Node | Event loop lag | ✅ | Node.js I/O health |
| Active handlers | ✅ | Concurrent operation count | |
| Memory Usage | ✅ | Node memory baseline | |
| Heap available | ✅ | GC pressure indicator |
Reading Order
Use the dashboard in this order during an incident:
- Check throughput, latency, and overall error rate.
- Check dependency panels such as external calls, DB calls, and queue lag.
- Check pool and runtime panels for saturation signals.
- Use deployment markers to see whether a release lines up with the behavior change.
Key takeaway: A good holistic dashboard supports a repeatable investigation flow instead of forcing engineers to hunt randomly.
On-Call Dashboard Readiness Checklist
Before a dashboard is declared ready, verify that it helps responders who did not build it.
Before declaring a dashboard “on-call ready”:
- ☐ Every mandatory metric is present
- ☐ Alert thresholds are drawn as reference lines on error rate and latency charts
- ☐ Deployment markers are overlaid on time-series panels
- ☐ Dashboard time range defaults to “last 1 hour” (not “last 3 months”)
- ☐ Dashboard is accessible without admin/edit permissions to on-call engineers
- ☐ Dashboard link is in the service README and runbook
- ☐ All panel titles state what they measure (not just metric names)
- ☐ No panels show data older than 30 days by default (causes chart compression)
Key takeaway: if a dashboard is hard to access, hard to read, or missing thresholds, it is not on-call ready no matter how many panels it has.
Dashboard Anti-Patterns
Danger: ANTI-PATTERN
Putting 40+ panels on one dashboard.
No one reads them all. On-call engineers scan the top 6 panels. Move deep-dive metrics to a separate “Debug” dashboard and link to it from the holistic view.
Danger: ANTI-PATTERN
Panels without threshold lines.
A 4% error rate on a chart looks fine until you realize your SLO threshold is 1%. Always draw the threshold.
Tip
Add the holistic dashboard as the first dashboard on-call engineers open. It should answer “is this service healthy right now?” without requiring navigation between multiple dashboards.
Final Takeaways
- Build dashboards around questions responders need answered quickly.
- Keep the holistic view small, obvious, and ordered by triage value.
- Separate operational dashboards from BI and reporting use cases.
- Standardize thresholds, deployment markers, and tool choices so dashboards behave predictably.
- Remove panels that do not influence action during incidents.
Leave a Reply