Observability Dashboards: Best Practices

Observability

Every microservice must have at minimum a Holistic View dashboard. The four dashboard types below are the standard set.

Dashboards are where observability becomes usable during real incidents. An alert may tell you that something is wrong, but the dashboard is what helps you understand scope, direction, and likely cause quickly. This guide focuses on building dashboards that support fast decisions instead of collecting decorative charts.

Quick Definitions

  • Holistic View: the primary dashboard that answers whether the service is healthy right now
  • APM: Application Performance Monitoring, typically used for latency, throughput, tracing, and error analysis
  • Threshold line: a visual marker showing when a metric is approaching or crossing an alert boundary
  • Deployment marker: a timestamp annotation showing when a new version was released
  • Panel: one chart or widget on a dashboard

Why Dashboards Matter

Alerts tell you something is wrong. Dashboards tell you what and why. A good dashboard lets an on-call engineer determine in under 30 seconds whether a service is healthy, without querying logs or running commands.

Key takeaway: dashboards are operational tools, not status posters. Their job is to speed up diagnosis.

Dashboard Design Principles

These principles keep dashboards readable under pressure, when responders have very little attention to spare.

PrincipleWhat it means
Answer a questionEvery dashboard should have a title that states the question it answers: “Is the API healthy?” not “API Metrics”
Top-to-bottom flowPut the most critical signals at the top (error rate, latency, throughput). Drill-down metrics go below
Consistent time rangesAll panels on a dashboard should use the same time window. Mixed windows cause false correlations
Minimal, not maximal8 well-chosen metrics > 40 metrics nobody reads. If it’s never been looked at during an incident, remove it
Threshold linesDraw the alert threshold as a reference line on every gauge panel. Engineers instantly see proximity to alert
Deployment markersOverlay deployments on time-series charts. 80% of incidents follow a deploy
Actionable panels onlyIf seeing a metric doesn’t lead to an action, it doesn’t belong on the on-call dashboard

Key takeaway: Dashboard design is mostly about editing down to what helps action, not adding every available metric.

Dashboard Capabilities

Not every dashboard serves the same audience. Mixing business reporting, engineering debugging, and infrastructure triage into one page usually makes all three worse.

CapabilityDescriptionData SourceTool
Reporting (Business Metrics)Daily/weekly reports e.g. order counts and states from transactional data, aggregated over time. Focus: past and present trends.Postgres (read replica)Looker Studio / Metabase
BIBusiness intelligence technologies and processes for enterprise data analysis. Focus: future trends.BigQueryLooker Studio
Analytics / Data Science / MLDiscover hidden patterns using math and statistics. Focus: past and present.BigQuery
APM / Infra / Log-based MetricsThroughput, latency, 4XX/5XX errors, logs-based custom metricsDatadogDatadog
Billing ReportsGCP resource usage costsBigQueryLooker Studio
Code Quality ReportsTest coverage, duplication, security vulnerabilities, tech debtSonarQubeSonarQube
DORA Metrics (4 Key Metrics)Deployment frequency, lead time to change, MTTR, change failure rateCustomCustom

Key takeaway: Choose dashboard tooling based on the question being answered, not on the tool your team happens to know best.

Metrics Tooling Matrix

This matrix is useful as a default ownership and tool-selection reference.

Metric DomainTool
InfraDatadog / Google Console
JVMDatadog
APMDatadog
Functional MetricsDatadog (based on logs and metrics)
Postgres MetricsDatadog / Google Console
Redis MetricsDatadog / Google Console
Log MonitoringDatadog
Pub/Sub MonitoringDatadog
Error / DLQ MonitoringDatadog / LCT
APIGEE MonitoringDatadog
Daily Reports from DBMetabase
Daily Reports from DatalakeGoogle Studio

Key takeaway: Standardizing tooling by metric domain reduces confusion during incidents and handoffs.

Sensible Defaults – Required Dashboards Per Service

Every service does not need every possible dashboard on day one, but it does need a minimum observability footprint.

Every microservice must have these four dashboards:

DashboardAudienceRefresh Rate
1. Holistic View – MandatoryOn-call engineer, team leads30 seconds
2. Infrastructure ViewSRE, platform team1 minute
3. DB ViewDBA, backend engineers1 minute
4. JVM Detailed ViewJava engineers, performance debugging1 minute

Key takeaway: Start with a mandatory minimum set, then add specialized views only when they support a clear need.

Holistic View Layout

The holistic view is the fastest way to decide whether the service problem is in the app, a dependency, the database, the queue, or the runtime.

The holistic view is the single pane of glass for a service. Every metric marked is mandatory.

CategoryMetricMandatoryWhy
App MetricsThroughputBaseline => are we getting requests?
Latency 95th percentileUser experience => are requests fast?
Overall Error RateHealth signal =>are requests succeeding?
4XX ErrorsClient errors => are clients sending bad requests?
5XX ErrorsServer errors => is our code broken?
Custom ErrorsDomain-specific signals
External CallsExternal Call Latency 95th percentileDependency health=> are upstreams slow?
External Call 4XX, 5XX ErrorDependency failures=> are upstreams broken?
DB CallsCustom DB Call Latency 95th percentileDB health proxy => slow queries surface here
Custom DB ErrorsDomain-specific query failures
Pub/SubOldest Unacked AgeConsumer lag => are we keeping up with the queue?
Unacked Message CountQueue depth => are messages accumulating?
Subscriber LatencyProcessing time per message
RedisMemory Size => total vs usedCache saturation => eviction risk?
PostgresTransactions per secondDB throughput baseline
Postgres connectionsPool saturation=> heading toward max_connections?
JVMThread statesJVM deep-dive (separate dashboard)
JVM Threads
JVM Memory
GC Pause Time
Hikari Connection PoolTotal connections (idle, pending, active)Pool health=> pending > 0 for > 30s = problem
Min / Max connectionConfig sanity check
NodeEvent loop lagNode.js I/O health
Active handlersConcurrent operation count
Memory UsageNode memory baseline
Heap availableGC pressure indicator

Reading Order

Use the dashboard in this order during an incident:

  1. Check throughput, latency, and overall error rate.
  2. Check dependency panels such as external calls, DB calls, and queue lag.
  3. Check pool and runtime panels for saturation signals.
  4. Use deployment markers to see whether a release lines up with the behavior change.

Key takeaway: A good holistic dashboard supports a repeatable investigation flow instead of forcing engineers to hunt randomly.

On-Call Dashboard Readiness Checklist

Before a dashboard is declared ready, verify that it helps responders who did not build it.

Before declaring a dashboard “on-call ready”:

  • ☐ Every mandatory metric is present
  • ☐ Alert thresholds are drawn as reference lines on error rate and latency charts
  • ☐ Deployment markers are overlaid on time-series panels
  • ☐ Dashboard time range defaults to “last 1 hour” (not “last 3 months”)
  • ☐ Dashboard is accessible without admin/edit permissions to on-call engineers
  • ☐ Dashboard link is in the service README and runbook
  • ☐ All panel titles state what they measure (not just metric names)
  • ☐ No panels show data older than 30 days by default (causes chart compression)

Key takeaway: if a dashboard is hard to access, hard to read, or missing thresholds, it is not on-call ready no matter how many panels it has.

Dashboard Anti-Patterns

Danger: ANTI-PATTERN

Putting 40+ panels on one dashboard.

No one reads them all. On-call engineers scan the top 6 panels. Move deep-dive metrics to a separate “Debug” dashboard and link to it from the holistic view.

Danger: ANTI-PATTERN

Panels without threshold lines.

A 4% error rate on a chart looks fine until you realize your SLO threshold is 1%. Always draw the threshold.

Tip

Add the holistic dashboard as the first dashboard on-call engineers open. It should answer “is this service healthy right now?” without requiring navigation between multiple dashboards.

Final Takeaways

  • Build dashboards around questions responders need answered quickly.
  • Keep the holistic view small, obvious, and ordered by triage value.
  • Separate operational dashboards from BI and reporting use cases.
  • Standardize thresholds, deployment markers, and tool choices so dashboards behave predictably.
  • Remove panels that do not influence action during incidents.
Balaji G
Written by
Balaji G

Leave a Reply

Discover more from 2G

Subscribe now to keep reading and get access to the full archive.

Continue reading