Postgres unreachable due to volume fsync latency spike on us-east4-eqdc4a (2026-05-12 UTC) — RCA?

Question

Posting publicly to ask Railway for an RCA — [status.railway.com](http://status.railway.com) shows nothing for this window.

A Postgres service on a persistent volume in us-east4-eqdc4a (image postgres-ssl:17) became unreachable on 2026-05-12.

App-side symptoms: statement timeout on simple session/messages updates and selects, then connections refused. No app

or DB changes in the prior 30 hours, workload steady.

Looking at log\_checkpoints, the underlying cause looks like a clean degradation of single-file fsync latency:

checkpoint complete series (UTC)

┌─────────────────────┬──────────┬───────────────┬────────────────┐

│ Time │ sync │ longest fsync │ distance (WAL) │

├─────────────────────┼──────────┼───────────────┼────────────────┤

│ 05-11 18:01 │ 0.015 s │ 0.012 s │ 54 kB │

├─────────────────────┼──────────┼───────────────┼────────────────┤

│ 05-11 21:46 │ 0.077 s │ 0.065 s │ 65 kB │

├─────────────────────┼──────────┼───────────────┼────────────────┤

│ 05-11 21:51 │ 0.647 s │ 0.557 s │ 719 kB │

├─────────────────────┼──────────┼───────────────┼────────────────┤

│ 05-11 22:22 │ 4.231 s │ 3.667 s │ 403 kB │

├─────────────────────┼──────────┼───────────────┼────────────────┤

│ 05-12 00:01 → 00:07 │ 76.843 s │ 31.879 s │ 8 kB │

└─────────────────────┴──────────┴───────────────┴────────────────┘

longest went from \~50 ms to 31.879 s to flush 8 kB of WAL — \~4000× over baseline with workload essentially idle, so

this can't be explained by app pressure. App-visible breakage started 7 minutes after the catastrophic checkpoint:

2026-05-12 00:14:16 UTC ERROR: canceling statement due to statement timeout

while locking tuple (41,32) in relation "session"

STATEMENT: update "session" set "expiresAt" = $1, …

PG logs in this window contain no FATAL / PANIC / ENOSPC / could not extend file. Volume usage at the time was \~1%. So

it's not disk-full, not a PG fault, not a code change.

We resized the volume as a mitigation at \~01:32 UTC. After the remount, PG reached ready to accept connections in 1.2

s on the same data. The next hour of checkpoints, on a higher workload than during the failure:

┌─────────────┬─────────┬─────────┬───────────┐

│ Time │ sync │ longest │ distance │

├─────────────┼─────────┼─────────┼───────────┤

│ 05-12 01:38 │ 0.043 s │ 0.015 s │ 1 693 kB │

├─────────────┼─────────┼─────────┼───────────┤

│ 05-12 01:46 │ 0.126 s │ 0.036 s │ 13 572 kB │

├─────────────┼─────────┼─────────┼───────────┤

│ 05-12 02:14 │ 0.108 s │ 0.027 s │ 8 594 kB │

├─────────────┼─────────┼─────────┼───────────┤

│ 05-12 02:22 │ 0.190 s │ 0.065 s │ 17 272 kB │

└─────────────┴─────────┴─────────┴───────────┘

Back to \~15–200 ms longest. Everything else unchanged — same image, same data, same volume handle — only the

underlying mount changed.

Asks

1\. Was there a storage-host event on us-east4-eqdc4a during 05-11 21:51 → 05-12 01:32 UTC?

2\. Did the resize-triggered remount move the volume to a different physical host? Post-recovery numbers strongly

suggest yes.

3\. Anyone else on us-east4-eqdc4a see fsync drift in this window? Would help to correlate.

Would really appreciate an RCA even retroactively.