Postgres Volume IO severely degraded — recurring production outages
mowsen
PROOP

2 months ago

Hey Team,

I have recurring outages in my Postgres instance: 93cef097-29d1-4570-a5cf-3fbe32f016e0

This leads to my app constantly crashing and rebooting (Django app). This has happened for the last two days and it crashes randomly every half hour or so...

I analyzed the problem with Claude Code and Codex because I dont understand anything about infrastructure.

This is the report claude code generated:

Setup: PostgreSQL 16, EU West (Amsterdam), 80 GB Volume, container running since Jan 7. 32 vCPU / 32 GB RAM limit, but actual usage is ~2–15 GB RAM / ~0% CPU — resources are not the bottleneck.

Problem: Recurring disk IO saturation on the attached Volume. Two incidents in the last hour alone:

Incident 1:

  • Single-row UPDATEs (by primary key) wait 30–60+ seconds on DataFileRead
  • WAL writes stall for 20–45 seconds
  • Checkpoint total time up to 297 seconds, single fsync: 54.5 seconds
  • Railway's own DataUI times out on SELECT table_name FROM information_schema.tables

Incident 2:

  • Simple SELECT ... WHERE email = '...' LIMIT 1 (indexed column) stuck for 63+ seconds on DataFileRead
  • Up to 9 processes blocked simultaneously waiting on disk
  • Gunicorn worker killed due to timeout → handle_bounce() in SES webhook processing
  • Lock monitor completely empty — zero lock contention, pure IO stall
  • Resolved itself after ~64 seconds (IO freed up)

Impact: Workers time out and crash repeatedly. After crash, shared_buffers cache is cold → everything must be read from the degraded disk → crash loop.

What we ruled out: Not a query issue (single-row PK/indexed lookups), not missing indices (verified), not lock contention (blocking monitor empty), not CPU/RAM (both near-idle). Railway's DataUI also failing confirms it's below the application layer.

Again:

I dont know much about infra (this is why I'm on railway btw).

What do I have to do to get this running smoothly again?

0 Replies

Welcome!

Sign in to your Railway account to join the conversation.

Loading...