Name: pgvitals
Author: pgvitals

Why monitor PostgreSQL

The vitals every database exposes

PostgreSQL keeps extraordinarily detailed telemetry in its system catalog. These nine signals predict 95% of production incidents — before they become incidents.

⚡

Query Performance

The most direct measure of database health. A query that ran in 5 ms last week and now takes 800 ms is telling you something changed — an index was dropped, a table grew past a plan flip threshold, stats went stale.

Ignored → slow UX, wasted CPU, cascading timeouts

Tracked via: pg_stat_statements

🗂

Index Health

Indexes that are never scanned waste write throughput on every INSERT/UPDATE/DELETE. Duplicate indexes compound the problem. Invalid indexes — left behind by a failed CREATE CONCURRENTLY — are invisible to the planner and invisible to the eye.

Ignored → 2–10× write amplification, missed scan optimizations

Tracked via: pg_stat_user_indexes, pg_index

🗄

Table Bloat & Vacuum

PostgreSQL's MVCC model never overwrites rows in place — every UPDATE creates a new row version. Autovacuum reclaims dead tuples, but it can fall behind under heavy write load. Once dead tuples exceed ~20% of a table, sequential scans slow down and the table's on-disk size balloons.

Ignored → wasted disk, slower scans, autovacuum runaway

Tracked via: pg_stat_user_tables, pg_class

🔌

Connection Pressure

PostgreSQL spawns a separate OS process per connection. Beyond max_connections every new connection attempt fails with a hard error. Unlike most resources, connection capacity is binary: at 99% you're fine; at 101% the application is down. Sessions stuck in idle in transaction silently hold locks and eat capacity without doing any work.

Ignored → application outage, lock accumulation

Tracked via: pg_stat_activity

🔒

Lock Contention

A single session holding a heavy lock — an ALTER TABLE, a long transaction, a forgotten cursor — can block every subsequent write to that table. PostgreSQL queues them up silently. In minutes you can go from one stuck session to 50 waiting sessions and a fully unresponsive application.

Ignored → cascading blockage, full table unavailability

Tracked via: pg_locks, pg_blocking_pids()

🔄

Replication Lag

Streaming replication keeps standbys in sync by shipping WAL. When a standby falls behind, stale reads go unnoticed and your failover RTO grows with every second of lag. Replication slots are even riskier — a paused consumer holds WAL on disk indefinitely, eventually filling the volume.

Ignored → stale reads, disk fill, high-RTO failover

Tracked via: pg_stat_replication, pg_replication_slots

⚠️

XID Wraparound

Every transaction in PostgreSQL consumes a 32-bit transaction ID. The counter wraps at ~2 billion. If a database reaches the safety limit without running VACUUM FREEZE, PostgreSQL performs an emergency shutdown to protect data integrity — all connections are refused until the database is brought up in single-user mode. This is the only self-inflicted PostgreSQL failure mode with no fast recovery path.

Ignored → forced database shutdown, manual recovery only

Tracked via: age(datfrozenxid) in pg_database

💾

Buffer Cache Hit Ratio

PostgreSQL's shared_buffers cache is the fastest path to data. A cache hit costs microseconds; a cache miss hits disk and costs milliseconds. A healthy OLTP database should be above 95–99%. A sustained drop signals that your working set has grown past your buffer pool, or a runaway query is evicting hot pages.

Ignored → disk I/O bottleneck, 100–1000× query slowdown

Tracked via: pg_stat_database blks_hit / blks_read

📍

Checkpoint Pressure

Checkpoints flush dirty pages from shared buffers to disk on a schedule. When writes outpace the schedule, PostgreSQL forces extra checkpoints — causing I/O spikes that appear as latency spikes in applications. Backends writing directly to disk (buffers_backend) bypass the checkpoint process entirely and indicate a severely undersized buffer pool.

Ignored → periodic I/O storms, unpredictable p99 latency

Tracked via: pg_stat_checkpointer (PG17+), pg_stat_bgwriter

Triage

Guided Symptom Triage

Choose a common symptom below to view a step-by-step diagnostic path with copy-pasteable SQL. Powered by pgvitals diagnostics.

🐌

Queries suddenly slow

Response times jumped; users reporting slowness...

Coverage

Every bottleneck, one place.

8 diagnostic areas, 40 sections, each with a clear threshold and action step so you always know what to do next.

⚡

Sections 01–05

Query Behavior

Slow queries, temp file spill, sequential scan storms, N+1 patterns, JIT overhead

🗂

Sections 06–10

Index Health

Unused, duplicate, invalid, and missing FK indexes — plus index bloat estimation

🗄

Sections 11–14, 34

Tables & Storage

Table and TOAST bloat, size rankings, heap vs index access patterns, partition health

🧹

Sections 15–18

Vacuum & Statistics

Autovacuum progress, dead tuple backlog, stale stats, long-running transactions blocking vacuum

🔗

Sections 19–22

Connections & Locks

Connection saturation, idle-in-transaction, full lock wait trees, wait event breakdown

🔄

Sections 23–25

Replication

Streaming lag per standby, logical slot consumer lag, WAL retention by all slots

⚠️

Sections 26–28, 35

Risk Signals

XID and MultiXact wraparound risk, sequence exhaustion, prepared transactions — the silent killers

⚙️

Sections 29–32

Config & Health

Key GUC settings review, buffer cache hit ratio, checkpoint pressure, DB summary

Usage

Many ways to use pgvitals.

Pick the workflow that fits — a production incident, a routine health check, a load test, a shareable report, or AI-assisted triage.

Ad-hoc investigation

Browse to the area matching your symptom — slow queries, locks, bloat. Open the section, read the threshold, run the SQL.

psql -d mydb -f sql/21_lock_wait_tree.sql

Full diagnostic sweep

Run all 40 sections in sequence before a release or after a migration. The master.sql file combines everything.

psql -d mydb -f master.sql

Health score

Get a single 0–100 score with a per-check breakdown — XID risk, dead tuples, connections, lock waits, cache hit ratio, and more.

psql -d mydb -f health_score.sql

Load test monitoring

Use the snapshot framework to capture metrics at intervals during a stress test, then diff baseline vs peak.

SELECT perf_monitor
.capture_snapshot('peak');

Zero-install Docker

No Python, no psql setup — run the full diagnostic runner from a container against any reachable database.

docker run --rm ghcr.io
/pgvitals/pgvitals --host db -d prod

Shareable HTML + AI

Generate a self-contained HTML report with a health gauge — and add --explain for a prioritized, plain-English fix list from Claude.

pgvitals -d prod --format html
--explain -o report.html

Load Testing

Built-in snapshot framework.

Capture a point-in-time snapshot of all metrics with one function call. Compare baseline vs peak. Find what changed.

Set up once

Run schema.sql and capture_snapshot.sql to create the monitoring schema and function.

Capture labelled snapshots

Call capture_snapshot('baseline') before the test, 'peak' at max load, 'cooldown' after.

Analyse the delta

Run trend_queries.sql to see connection growth, lock wait spikes, dead tuple accumulation, cache hit ratio trends.

psql — mydb

-- setup (run once)

mydb=# \i monitoring/schema.sql

mydb=# \i monitoring/capture_snapshot.sql

-- before load test

mydb=# SELECT perf_monitor

.capture_snapshot('baseline');

capture_snapshot

─────────────────

-- at peak load

mydb=# SELECT perf_monitor

.capture_snapshot('peak');

capture_snapshot

─────────────────

-- analyse trends

mydb=# \i monitoring/trend_queries.sql

Automation

Diagnostic Runner & Live Reports

pgvitals includes a Python-based diagnostic runner that connects to any live database, runs all 40 sections, and generates a Markdown or self-contained HTML report — with an optional AI analysis that prioritizes the fixes.

pip install pgvitals

Install the pgvitals CLI with pip. Downloads the SQL library once with init, then run full diagnostics against any database — no config files needed for a quick check.

$ pip install pgvitals
$ pgvitals init# download SQL files once

# Run full diagnostics against any database
$ pgvitals --host db.example.com --database prod

# Shareable HTML report + AI analysis
$ pgvitals -d prod --format html --explain -o report.html

Layered Configuration System

Configure the runner to suit your deployment. Parameters are evaluated in sequence: default config → local pgvitals.conf → named connection profiles → environment variables (like PGPASSWORD) → command-line flags.

🩺 Diagnostic Report

Database: pfmegrnargs on hh-pgsql-public.ebi.ac.uk
PostgreSQL Version: 16.11 (RNAcentral DB)
Connected As: reader

✅ Clear

📊 Data

⚠️ Error

Sample Executive Summary

#	Section	Status	Findings
01	Slow Queries	Data	5 rows returned
06	Unused Indexes	Data	36 rows returned
08	Invalid Indexes	Clear	No issues
15	Autovacuum Activity	Error	Permission Denied

Key Findings Highlighted

🔴 36 Duplicate Indexes detected on active tables, consuming write capacity with no query benefit.

🔴 30 Missing Foreign Key Indexes leading to potential full sequential scan storms on joins.

🟡 Index Bloat up to 78% on primary data tables like rnc_reference_map.

View full live example report → RNAcentral · PostgreSQL 16 · 638 GB

Stop Guessing.
Start Diagnosing.

The vitals every database exposes

Guided Symptom Triage

Queries suddenly slow

Every bottleneck, one place.

Many ways to use pgvitals.

Built-in snapshot framework.

Diagnostic Runner & Live Reports

🩺 Diagnostic Report

Your database is telling you something.

Stop Guessing.Start Diagnosing.

The vitals every database exposes

Guided Symptom Triage

Queries suddenly slow

Every bottleneck, one place.

Many ways to use pgvitals.

Built-in snapshot framework.

Diagnostic Runner & Live Reports

🩺 Diagnostic Report

Your database is telling you something.

Stop Guessing.
Start Diagnosing.