🩺 PostgreSQL Diagnostics

Stop Guessing.
Start Diagnosing.

40 copy-paste SQL queries covering every PostgreSQL bottleneck — plus a 0–100 health score, a shareable HTML report, and optional AI analysis. Run them in psql, the CLI, or Docker. No agents. No install. Just SQL.

pgvitals.github.io/diagnostics
Query Behavior #01
Slow / Expensive Queries
WhatTop queries by total CPU time
Look formean_exec_ms > 100
ActionEXPLAIN ANALYZE top offenders
SQL Copy
SELECT round(total_exec_time, 2) AS total_ms,
    calls, round(mean_exec_time, 2) AS mean_ms,
    left(query, 120) AS query_snippet
FROM pg_stat_statements
ORDER BY total_exec_time DESC
LIMIT 25;

The vitals every database exposes

PostgreSQL keeps extraordinarily detailed telemetry in its system catalog. These nine signals predict 95% of production incidents — before they become incidents.

Query Performance
The most direct measure of database health. A query that ran in 5 ms last week and now takes 800 ms is telling you something changed — an index was dropped, a table grew past a plan flip threshold, stats went stale.
Ignored → slow UX, wasted CPU, cascading timeouts
Tracked via: pg_stat_statements
🗂
Index Health
Indexes that are never scanned waste write throughput on every INSERT/UPDATE/DELETE. Duplicate indexes compound the problem. Invalid indexes — left behind by a failed CREATE CONCURRENTLY — are invisible to the planner and invisible to the eye.
Ignored → 2–10× write amplification, missed scan optimizations
Tracked via: pg_stat_user_indexes, pg_index
🗄
Table Bloat & Vacuum
PostgreSQL's MVCC model never overwrites rows in place — every UPDATE creates a new row version. Autovacuum reclaims dead tuples, but it can fall behind under heavy write load. Once dead tuples exceed ~20% of a table, sequential scans slow down and the table's on-disk size balloons.
Ignored → wasted disk, slower scans, autovacuum runaway
Tracked via: pg_stat_user_tables, pg_class
🔌
Connection Pressure
PostgreSQL spawns a separate OS process per connection. Beyond max_connections every new connection attempt fails with a hard error. Unlike most resources, connection capacity is binary: at 99% you're fine; at 101% the application is down. Sessions stuck in idle in transaction silently hold locks and eat capacity without doing any work.
Ignored → application outage, lock accumulation
Tracked via: pg_stat_activity
🔒
Lock Contention
A single session holding a heavy lock — an ALTER TABLE, a long transaction, a forgotten cursor — can block every subsequent write to that table. PostgreSQL queues them up silently. In minutes you can go from one stuck session to 50 waiting sessions and a fully unresponsive application.
Ignored → cascading blockage, full table unavailability
Tracked via: pg_locks, pg_blocking_pids()
🔄
Replication Lag
Streaming replication keeps standbys in sync by shipping WAL. When a standby falls behind, stale reads go unnoticed and your failover RTO grows with every second of lag. Replication slots are even riskier — a paused consumer holds WAL on disk indefinitely, eventually filling the volume.
Ignored → stale reads, disk fill, high-RTO failover
Tracked via: pg_stat_replication, pg_replication_slots
⚠️
XID Wraparound
Every transaction in PostgreSQL consumes a 32-bit transaction ID. The counter wraps at ~2 billion. If a database reaches the safety limit without running VACUUM FREEZE, PostgreSQL performs an emergency shutdown to protect data integrity — all connections are refused until the database is brought up in single-user mode. This is the only self-inflicted PostgreSQL failure mode with no fast recovery path.
Ignored → forced database shutdown, manual recovery only
Tracked via: age(datfrozenxid) in pg_database
💾
Buffer Cache Hit Ratio
PostgreSQL's shared_buffers cache is the fastest path to data. A cache hit costs microseconds; a cache miss hits disk and costs milliseconds. A healthy OLTP database should be above 95–99%. A sustained drop signals that your working set has grown past your buffer pool, or a runaway query is evicting hot pages.
Ignored → disk I/O bottleneck, 100–1000× query slowdown
Tracked via: pg_stat_database blks_hit / blks_read
📍
Checkpoint Pressure
Checkpoints flush dirty pages from shared buffers to disk on a schedule. When writes outpace the schedule, PostgreSQL forces extra checkpoints — causing I/O spikes that appear as latency spikes in applications. Backends writing directly to disk (buffers_backend) bypass the checkpoint process entirely and indicate a severely undersized buffer pool.
Ignored → periodic I/O storms, unpredictable p99 latency
Tracked via: pg_stat_checkpointer (PG17+), pg_stat_bgwriter
40
Diagnostic sections
0
Dependencies to install
PG14+
PostgreSQL compatible
MIT
Open source license

Guided Symptom Triage

Choose a common symptom below to view a step-by-step diagnostic path with copy-pasteable SQL. Powered by pgvitals diagnostics.

🐌

Queries suddenly slow

Response times jumped; users reporting slowness...

Every bottleneck, one place.

8 diagnostic areas, 40 sections, each with a clear threshold and action step so you always know what to do next.

Sections 01–05
Query Behavior
Slow queries, temp file spill, sequential scan storms, N+1 patterns, JIT overhead
🗂
Sections 06–10
Index Health
Unused, duplicate, invalid, and missing FK indexes — plus index bloat estimation
🗄
Sections 11–14, 34
Tables & Storage
Table and TOAST bloat, size rankings, heap vs index access patterns, partition health
🧹
Sections 15–18
Vacuum & Statistics
Autovacuum progress, dead tuple backlog, stale stats, long-running transactions blocking vacuum
🔗
Sections 19–22
Connections & Locks
Connection saturation, idle-in-transaction, full lock wait trees, wait event breakdown
🔄
Sections 23–25
Replication
Streaming lag per standby, logical slot consumer lag, WAL retention by all slots
⚠️
Sections 26–28, 35
Risk Signals
XID and MultiXact wraparound risk, sequence exhaustion, prepared transactions — the silent killers
⚙️
Sections 29–32
Config & Health
Key GUC settings review, buffer cache hit ratio, checkpoint pressure, DB summary

Many ways to use pgvitals.

Pick the workflow that fits — a production incident, a routine health check, a load test, a shareable report, or AI-assisted triage.

1
Ad-hoc investigation
Browse to the area matching your symptom — slow queries, locks, bloat. Open the section, read the threshold, run the SQL.
psql -d mydb -f sql/21_lock_wait_tree.sql
2
Full diagnostic sweep
Run all 40 sections in sequence before a release or after a migration. The master.sql file combines everything.
psql -d mydb -f master.sql
3
Health score
Get a single 0–100 score with a per-check breakdown — XID risk, dead tuples, connections, lock waits, cache hit ratio, and more.
psql -d mydb -f health_score.sql
4
Load test monitoring
Use the snapshot framework to capture metrics at intervals during a stress test, then diff baseline vs peak.
SELECT perf_monitor
  .capture_snapshot('peak');
5
Zero-install Docker
No Python, no psql setup — run the full diagnostic runner from a container against any reachable database.
docker run --rm ghcr.io
  /pgvitals/pgvitals --host db -d prod
6
Shareable HTML + AI
Generate a self-contained HTML report with a health gauge — and add --explain for a prioritized, plain-English fix list from Claude.
pgvitals -d prod --format html
  --explain -o report.html

Built-in snapshot framework.

Capture a point-in-time snapshot of all metrics with one function call. Compare baseline vs peak. Find what changed.

1
Set up once
Run schema.sql and capture_snapshot.sql to create the monitoring schema and function.
2
Capture labelled snapshots
Call capture_snapshot('baseline') before the test, 'peak' at max load, 'cooldown' after.
3
Analyse the delta
Run trend_queries.sql to see connection growth, lock wait spikes, dead tuple accumulation, cache hit ratio trends.
psql — mydb
-- setup (run once)
mydb=# \i monitoring/schema.sql
mydb=# \i monitoring/capture_snapshot.sql
 
-- before load test
mydb=# SELECT perf_monitor
  .capture_snapshot('baseline');
 capture_snapshot
─────────────────
 1
 
-- at peak load
mydb=# SELECT perf_monitor
  .capture_snapshot('peak');
 capture_snapshot
─────────────────
 2
 
-- analyse trends
mydb=# \i monitoring/trend_queries.sql

Diagnostic Runner & Live Reports

pgvitals includes a Python-based diagnostic runner that connects to any live database, runs all 40 sections, and generates a Markdown or self-contained HTML report — with an optional AI analysis that prioritizes the fixes.

pip install pgvitals
Install the pgvitals CLI with pip. Downloads the SQL library once with init, then run full diagnostics against any database — no config files needed for a quick check.
$ pip install pgvitals
$ pgvitals init# download SQL files once

# Run full diagnostics against any database
$ pgvitals --host db.example.com --database prod

# Shareable HTML report + AI analysis
$ pgvitals -d prod --format html --explain -o report.html
Layered Configuration System
Configure the runner to suit your deployment. Parameters are evaluated in sequence: default config → local pgvitals.conf → named connection profiles → environment variables (like PGPASSWORD) → command-line flags.

🩺 Diagnostic Report

Database: pfmegrnargs on hh-pgsql-public.ebi.ac.uk
PostgreSQL Version: 16.11 (RNAcentral DB)
Connected As: reader
✅ Clear
8
📊 Data
19
⚠️ Error
10
Sample Executive Summary
# Section Status Findings
01 Slow Queries Data 5 rows returned
06 Unused Indexes Data 36 rows returned
08 Invalid Indexes Clear No issues
15 Autovacuum Activity Error Permission Denied
Key Findings Highlighted
🔴 36 Duplicate Indexes detected on active tables, consuming write capacity with no query benefit.
🔴 30 Missing Foreign Key Indexes leading to potential full sequential scan storms on joins.
🟡 Index Bloat up to 78% on primary data tables like rnc_reference_map.
View full live example report → RNAcentral · PostgreSQL 16 · 638 GB

Your database is telling you something.

40 sections. Every bottleneck covered. Copy, paste, act.