Observability Dashboards: Best Practices

Observability

Every microservice must have at minimum a Holistic View dashboard. The four dashboard types below are the standard set.

Dashboards are where observability becomes usable during real incidents. An alert may tell you that something is wrong, but the dashboard is what helps you understand scope, direction, and likely cause quickly. This guide focuses on building dashboards that support fast decisions instead of collecting decorative charts.

Quick Definitions

Holistic View: the primary dashboard that answers whether the service is healthy right now
APM: Application Performance Monitoring, typically used for latency, throughput, tracing, and error analysis
Threshold line: a visual marker showing when a metric is approaching or crossing an alert boundary
Deployment marker: a timestamp annotation showing when a new version was released
Panel: one chart or widget on a dashboard

Why Dashboards Matter

Alerts tell you something is wrong. Dashboards tell you what and why. A good dashboard lets an on-call engineer determine in under 30 seconds whether a service is healthy, without querying logs or running commands.

Key takeaway: dashboards are operational tools, not status posters. Their job is to speed up diagnosis.

Dashboard Design Principles

These principles keep dashboards readable under pressure, when responders have very little attention to spare.

Principle	What it means
Answer a question	Every dashboard should have a title that states the question it answers: “Is the API healthy?” not “API Metrics”
Top-to-bottom flow	Put the most critical signals at the top (error rate, latency, throughput). Drill-down metrics go below
Consistent time ranges	All panels on a dashboard should use the same time window. Mixed windows cause false correlations
Minimal, not maximal	8 well-chosen metrics > 40 metrics nobody reads. If it’s never been looked at during an incident, remove it
Threshold lines	Draw the alert threshold as a reference line on every gauge panel. Engineers instantly see proximity to alert
Deployment markers	Overlay deployments on time-series charts. 80% of incidents follow a deploy
Actionable panels only	If seeing a metric doesn’t lead to an action, it doesn’t belong on the on-call dashboard

Key takeaway: Dashboard design is mostly about editing down to what helps action, not adding every available metric.

Dashboard Capabilities

Not every dashboard serves the same audience. Mixing business reporting, engineering debugging, and infrastructure triage into one page usually makes all three worse.

Capability	Description	Data Source	Tool
Reporting (Business Metrics)	Daily/weekly reports e.g. order counts and states from transactional data, aggregated over time. Focus: past and present trends.	Postgres (read replica)	Looker Studio / Metabase
BI	Business intelligence technologies and processes for enterprise data analysis. Focus: future trends.	BigQuery	Looker Studio
Analytics / Data Science / ML	Discover hidden patterns using math and statistics. Focus: past and present.	BigQuery	—
APM / Infra / Log-based Metrics	Throughput, latency, 4XX/5XX errors, logs-based custom metrics	Datadog	Datadog
Billing Reports	GCP resource usage costs	BigQuery	Looker Studio
Code Quality Reports	Test coverage, duplication, security vulnerabilities, tech debt	SonarQube	SonarQube
DORA Metrics (4 Key Metrics)	Deployment frequency, lead time to change, MTTR, change failure rate	Custom	Custom

Key takeaway: Choose dashboard tooling based on the question being answered, not on the tool your team happens to know best.

Metrics Tooling Matrix

This matrix is useful as a default ownership and tool-selection reference.

Metric Domain	Tool
Infra	Datadog / Google Console
JVM	Datadog
APM	Datadog
Functional Metrics	Datadog (based on logs and metrics)
Postgres Metrics	Datadog / Google Console
Redis Metrics	Datadog / Google Console
Log Monitoring	Datadog
Pub/Sub Monitoring	Datadog
Error / DLQ Monitoring	Datadog / LCT
APIGEE Monitoring	Datadog
Daily Reports from DB	Metabase
Daily Reports from Datalake	Google Studio

Key takeaway: Standardizing tooling by metric domain reduces confusion during incidents and handoffs.

Sensible Defaults – Required Dashboards Per Service

Every service does not need every possible dashboard on day one, but it does need a minimum observability footprint.

Every microservice must have these four dashboards:

Dashboard	Audience	Refresh Rate
1. Holistic View – Mandatory	On-call engineer, team leads	30 seconds
2. Infrastructure View	SRE, platform team	1 minute
3. DB View	DBA, backend engineers	1 minute
4. JVM Detailed View	Java engineers, performance debugging	1 minute

Key takeaway: Start with a mandatory minimum set, then add specialized views only when they support a clear need.

Holistic View Layout

The holistic view is the fastest way to decide whether the service problem is in the app, a dependency, the database, the queue, or the runtime.

The holistic view is the single pane of glass for a service. Every metric marked ✅ is mandatory.

Category	Metric	Mandatory	Why
App Metrics	Throughput	✅	Baseline => are we getting requests?
	Latency 95th percentile	✅	User experience => are requests fast?
	Overall Error Rate	✅	Health signal =>are requests succeeding?
	4XX Errors	✅	Client errors => are clients sending bad requests?
	5XX Errors	✅	Server errors => is our code broken?
	Custom Errors	—	Domain-specific signals
External Calls	External Call Latency 95th percentile	✅	Dependency health=> are upstreams slow?
	External Call 4XX, 5XX Error	✅	Dependency failures=> are upstreams broken?
DB Calls	Custom DB Call Latency 95th percentile	✅	DB health proxy => slow queries surface here
	Custom DB Errors	—	Domain-specific query failures
Pub/Sub	Oldest Unacked Age	✅	Consumer lag => are we keeping up with the queue?
	Unacked Message Count	✅	Queue depth => are messages accumulating?
	Subscriber Latency	✅	Processing time per message
Redis	Memory Size => total vs used	✅	Cache saturation => eviction risk?
Postgres	Transactions per second	✅	DB throughput baseline
	Postgres connections	✅	Pool saturation=> heading toward max_connections?
JVM	Thread states	—	JVM deep-dive (separate dashboard)
	JVM Threads	—
	JVM Memory	—
	GC Pause Time	—
Hikari Connection Pool	Total connections (idle, pending, active)	✅	Pool health=> pending > 0 for > 30s = problem
	Min / Max connection	✅	Config sanity check
Node	Event loop lag	✅	Node.js I/O health
	Active handlers	✅	Concurrent operation count
	Memory Usage	✅	Node memory baseline
	Heap available	✅	GC pressure indicator

Reading Order

Use the dashboard in this order during an incident:

Check throughput, latency, and overall error rate.
Check dependency panels such as external calls, DB calls, and queue lag.
Check pool and runtime panels for saturation signals.
Use deployment markers to see whether a release lines up with the behavior change.

Key takeaway: A good holistic dashboard supports a repeatable investigation flow instead of forcing engineers to hunt randomly.

On-Call Dashboard Readiness Checklist

Before a dashboard is declared ready, verify that it helps responders who did not build it.

Before declaring a dashboard “on-call ready”:

☐ Every mandatory metric is present
☐ Alert thresholds are drawn as reference lines on error rate and latency charts
☐ Deployment markers are overlaid on time-series panels
☐ Dashboard time range defaults to “last 1 hour” (not “last 3 months”)
☐ Dashboard is accessible without admin/edit permissions to on-call engineers
☐ Dashboard link is in the service README and runbook
☐ All panel titles state what they measure (not just metric names)
☐ No panels show data older than 30 days by default (causes chart compression)

Key takeaway: if a dashboard is hard to access, hard to read, or missing thresholds, it is not on-call ready no matter how many panels it has.

Dashboard Anti-Patterns

Danger: ANTI-PATTERN

Putting 40+ panels on one dashboard.

No one reads them all. On-call engineers scan the top 6 panels. Move deep-dive metrics to a separate “Debug” dashboard and link to it from the holistic view.

Danger: ANTI-PATTERN

Panels without threshold lines.

A 4% error rate on a chart looks fine until you realize your SLO threshold is 1%. Always draw the threshold.

Tip

Add the holistic dashboard as the first dashboard on-call engineers open. It should answer “is this service healthy right now?” without requiring navigation between multiple dashboards.

Final Takeaways

Build dashboards around questions responders need answered quickly.
Keep the holistic view small, obvious, and ordered by triage value.
Separate operational dashboards from BI and reporting use cases.
Standardize thresholds, deployment markers, and tool choices so dashboards behave predictably.
Remove panels that do not influence action during incidents.

Written by

Balaji G

2G