Capacity Planning for Microservices

Capacity planning is where observability meets engineering economics. Undersized services fail under stress, while oversized services hide inefficiencies and inflate cost. The goal is not maximum headroom everywhere. The goal is to know how your system behaves under load and size it deliberately.

These are starting point defaults. Always run performance testing to validate and tune for your specific workload. Every service has different memory, CPU, and connection profiles.

Quick Definitions

Request: the minimum CPU or memory Kubernetes reserves for a pod
Limit: the maximum CPU or memory a pod is allowed to consume
P99 latency: the response time that 99% of requests are faster than
Connection pool: a managed set of reusable database or cache connections shared by application threads
Headroom: spare capacity available to absorb spikes without immediate failure

Why Get Capacity Right

Under-provisioned services cause cascading failures, one slow pod brings down the entire request path. Over-provisioned services waste money and mask problems (a service shouldn’t need 8GB RAM for CRUD operations).

The goal is right-sized -> enough headroom to handle traffic spikes without burning cloud budget.

Key takeaway: Capacity planning is a balancing exercise between resilience, performance, and cost.

Capacity Planning Workflow

			
BASELINE   → Run load test at expected peak traffic
MEASURE    → Record CPU%, memory%, connection pool usage, GC pause time
TUNE       → Adjust limits until p99 latency is stable and CPU < 60%
VALIDATE   → Re-run load test at 2× peak (test the headroom)
DEPLOY     → Set requests/limits based on validated numbers
MONITOR    → Watch for drift over the first two weeks post-deploy

		

Key takeaway: sizing should come from measurement and retesting, not from copying values blindly between services.

Kubernetes (Infra)

Kubernetes settings determine whether pods can start reliably, scale safely, and survive real traffic instead of only synthetic happy paths.

Config	Production	Why
Minimum pods	3	Handles loss of 1 pod without impact; enables zero-downtime rolling deploys
Maximum pods	10 (tune per requirements)	Sets upper bound on cost and connection pool usage
`resources.requests.memory`	2 GB	Scheduler uses this for pod placement; set to typical working set
`resources.requests.cpu`	1 CPU	Scheduler uses this; set to steady-state usage, not peak
`resources.limits.memory`	3 GB	OOM kill threshold; 1.5× requests gives headroom for spikes
`resources.limits.cpu`	2 CPU	CPU is throttled (not killed) at this limit; allow 2× for burst
Autoscale CPU threshold	70%	At 70% CPU we add pods. Don’t wait until 80%, scaling takes 1–2 min
Autoscale Memory threshold	90%	Memory autoscale is a safety net; ideally tune requests to avoid hitting it
JVM `-Xms`	1.5 GB	Pre-allocate heap to avoid GC pressure during warm-up
JVM `-Xmx`	1.5 GB	Same as Xms to prevent heap resizing pauses in production

Why requests ≠ limits?

requests = what the scheduler reserves. limits = the hard cap.

Setting them equal eliminates bursting – under sudden load, the pod is throttled at its request value. Setting limits 1.5–2× requests allows burst absorption.

Tip: JVM Memory Alternative

If -Xms/-Xmx are not set, use container-aware flags instead:
-XX:+UseContainerSupport -XX:MaxRAMPercentage=70.0
This sets heap to 70% of the container memory limit, automatically respecting Kubernetes limits.

Warning

Keep Xms = Xmx in production. Letting JVM start at a small heap and grow to Xmx causes GC pressure during ramp-up(your first traffic spike after a deploy will trigger Full GCs).

Common Pitfall

Teams often copy memory and CPU limits from another service because the tech stack is similar. That usually ignores actual differences in payload size, concurrency, caching behaviour, and background processing.

Key takeaway: Kubernetes defaults are starting points only. Validate them under realistic concurrency and traffic mix.

Postgres

Database capacity is often the real ceiling for a service, even when pods still look healthy.

Config	Production	Notes
CPU	4–8 cores	4 cores handles most OLTP workloads; 8 cores for high write throughput
Memory	8 GB – 16 GB	More RAM = larger shared_buffers = fewer disk reads
SSD	300 GB	Includes data, WAL, and temporary sort files
`max_connections`	300 (based on RAM)	Each connection uses ~5–10 MB RAM. Don’t exceed 500 without pgBouncer

Why Not Set max_connections Higher?

Each Postgres connection holds memory for its query context, sort buffers, and transaction state. Setting max_connections = 1000 Without a connection pooler means 1000 × 10MB = 10GB RAM just for connections, leaving nothing for actual query execution.

Use HikariCP (application-level pooling) and set max_connections conservatively.

Use pgtune to tweak postgres configs like shared_buffer, effective_cache etc based on postgres capacity, do not go by default configs.

Real-World Scenario

A team may respond to too many connections by simply increasing max_connections. That can postpone the symptom briefly while making the database slower overall because RAM is now tied up by idle and competing sessions.

Key takeaway: More connections are not automatically more capacity. Database throughput often improves when connection counts are controlled.

HikariCP Connection Pool

Application pool sizing must reflect both database limits and the maximum number of pods that can exist at the same time.

Property	Value	Notes
`connection-timeout`	20,000 ms	Time a thread waits for a connection. Requests fail after this with `ConnectionTimeout`
`minimum-idle`	5–10	Connections kept warm. Low = slower first requests after idle period
`maximum-pool-size`	10–20	See formula below
`idle-timeout`	10,000 ms	Connections idle longer than this are retired from the pool
`max-lifetime`	1,800,000 ms	Max connection age. Retire before Postgres `tcp_keepalives_idle` to avoid stale connections
`keepaliveTime`	30,000 ms	Sends lightweight queries to keep connections alive through load balancers

Pool Size Formula

			
maximum-pool-size = floor(postgres_max_connections / max_pods)
Example:
  postgres max_connections = 600
  max_pods = 40
  maximum-pool-size = floor(600 / 40) = 15

		

Warning

Always leave 10–20 connections reserved for admin operations (pg_dump, direct psql access, monitoring queries). Don’t assign 100% of max_connections to the pool.

Adjusted formula:
maximum-pool-size = floor((postgres_max_connections - 20) / max_pods)

Common Pitfall

If each pod is given an oversized pool “just in case,” autoscaling can multiply that decision into hundreds of possible connections that the database cannot actually support.

Key takeaway: pool sizing is a system-wide calculation, not a per-service guess.

Redis Connection Pool

Redis is fast, but poor timeout and pool settings can still turn it into a bottleneck under concurrency.

Property	Value	Notes
`connectTimeout`	5,000 ms	Time to establish TCP connection to Redis
`soTimeout`	1,000 ms	Time to wait for a response after sending a command
`redis-connection-max-total`	15	For 40 pods max → 600 total connections
`redis-connection-max-idle`	15	Max idle connections to keep in pool
`redis-connection-min-idle`	10	Connections kept warm avoids connection ramp-up under load
`maxWaitMillis`	1,000 ms	Time to wait for pool to return a connection. Should equal `soTimeout`
`min-evictable-idle-time-millis`	60 sec – 30 min	Higher value = lower connection ramp-up. Match to your traffic pattern

Key takeaway: Redis pool settings should absorb burst traffic without creating more client pressure than the cache can handle.

Signs Your Capacity Is Wrong

These symptoms are often easier to observe than the original sizing mistake, so they are useful during tuning and early production monitoring.

Symptom	Likely Cause	Fix
High GC pause time (>100ms)	Heap too small or Xms < Xmx	Increase Xmx or set Xms = Xmx
`ConnectionTimeout` errors	Pool exhausted	Increase `maximum-pool-size` or reduce `max_connections` from other services
Pod OOMKilled	Memory limit too low	Increase `resources.limits.memory` or fix memory leak
HPA oscillating (scaling up and down rapidly)	Autoscale threshold too sensitive or cooldown too short	Raise CPU threshold to 70%, set stabilization window to 5 min
Postgres `too many connections`	Connection pool misconfigured	Verify formula; add pgBouncer if > 300 connections needed
Redis `ERR max number of clients reached`	Redis `maxclients` too low or pool too large	Check `redis-connection-max-total × max_pods` vs Redis `maxclients` config

Warning

After any capacity change, monitor connection pool metrics (idle, pending, active) for at least one full traffic cycle before considering the sizing stable.

Final Takeaways

Start with defaults, but treat them as hypotheses to validate with load testing.
Size Kubernetes pods, databases, and connection pools together rather than independently.
Leave headroom for spikes, admin access, and autoscaling side effects.
Watch for second-order symptoms such as GC pauses, queueing, and connection exhaustion.
Recheck capacity after major feature, traffic, or workload shape changes.

Written by

Balaji G

2G