Postgres Volume IO severely degraded — recurring production outages

Question

Hey Team,

I have recurring outages in my Postgres instance: 93cef097-29d1-4570-a5cf-3fbe32f016e0

This leads to my app constantly crashing and rebooting (Django app). This has happened for the last two days and it crashes randomly every half hour or so...

I analyzed the problem with Claude Code and Codex because I dont understand anything about infrastructure.

This is the report claude code generated:

**Setup:** PostgreSQL 16, EU West (Amsterdam), 80 GB Volume, container running since Jan 7. 32 vCPU / 32 GB RAM limit, but actual usage is ~2–15 GB RAM / ~0% CPU — resources are not the bottleneck.

**Problem:** Recurring disk IO saturation on the attached Volume. Two incidents in the last hour alone:

**Incident 1:**
- Single-row UPDATEs (by primary key) wait **30–60+ seconds** on `DataFileRead`
- WAL writes stall for **20–45 seconds**
- Checkpoint total time up to **297 seconds**, single fsync: **54.5 seconds**
- Railway's own DataUI times out on `SELECT table_name FROM information_schema.tables`

**Incident 2:**
- Simple `SELECT ... WHERE email = '...' LIMIT 1` (indexed column) stuck for **63+ seconds** on `DataFileRead`
- Up to **9 processes** blocked simultaneously waiting on disk
- Gunicorn worker killed due to timeout → `handle_bounce()` in SES webhook processing
- Lock monitor completely empty — **zero lock contention**, pure IO stall
- Resolved itself after ~64 seconds (IO freed up)

**Impact:** Workers time out and crash repeatedly. After crash, shared_buffers cache is cold → everything must be read from the degraded disk → crash loop.

**What we ruled out:** Not a query issue (single-row PK/indexed lookups), not missing indices (verified), not lock contention (blocking monitor empty), not CPU/RAM (both near-idle). Railway's DataUI also failing confirms it's below the application layer.

Again:

I dont know much about infra (this is why I'm on railway btw).

What do I have to do to get this running smoothly again?