2 months ago
Hey Team,
I have recurring outages in my Postgres instance: 93cef097-29d1-4570-a5cf-3fbe32f016e0
This leads to my app constantly crashing and rebooting (Django app). This has happened for the last two days and it crashes randomly every half hour or so...
I analyzed the problem with Claude Code and Codex because I dont understand anything about infrastructure.
This is the report claude code generated:
Setup: PostgreSQL 16, EU West (Amsterdam), 80 GB Volume, container running since Jan 7. 32 vCPU / 32 GB RAM limit, but actual usage is ~2–15 GB RAM / ~0% CPU — resources are not the bottleneck.
Problem: Recurring disk IO saturation on the attached Volume. Two incidents in the last hour alone:
Incident 1:
- Single-row UPDATEs (by primary key) wait 30–60+ seconds on
DataFileRead - WAL writes stall for 20–45 seconds
- Checkpoint total time up to 297 seconds, single fsync: 54.5 seconds
- Railway's own DataUI times out on
SELECT table_name FROM information_schema.tables
Incident 2:
- Simple
SELECT ... WHERE email = '...' LIMIT 1(indexed column) stuck for 63+ seconds onDataFileRead - Up to 9 processes blocked simultaneously waiting on disk
- Gunicorn worker killed due to timeout →
handle_bounce()in SES webhook processing - Lock monitor completely empty — zero lock contention, pure IO stall
- Resolved itself after ~64 seconds (IO freed up)
Impact: Workers time out and crash repeatedly. After crash, shared_buffers cache is cold → everything must be read from the degraded disk → crash loop.
What we ruled out: Not a query issue (single-row PK/indexed lookups), not missing indices (verified), not lock contention (blocking monitor empty), not CPU/RAM (both near-idle). Railway's DataUI also failing confirms it's below the application layer.
Again:
I dont know much about infra (this is why I'm on railway btw).
What do I have to do to get this running smoothly again?
0 Replies