Postgres unreachable due to volume fsync latency spike on us-east4-eqdc4a (2026-05-12 UTC) — RCA?
czfly
PROOP

11 days ago

Posting publicly to ask Railway for an RCA — status.railway.com shows nothing for this window.

A Postgres service on a persistent volume in us-east4-eqdc4a (image postgres-ssl:17) became unreachable on 2026-05-12.

App-side symptoms: statement timeout on simple session/messages updates and selects, then connections refused. No app

or DB changes in the prior 30 hours, workload steady.

Looking at log_checkpoints, the underlying cause looks like a clean degradation of single-file fsync latency:

checkpoint complete series (UTC)

┌─────────────────────┬──────────┬───────────────┬────────────────┐

│ Time │ sync │ longest fsync │ distance (WAL) │

├─────────────────────┼──────────┼───────────────┼────────────────┤

│ 05-11 18:01 │ 0.015 s │ 0.012 s │ 54 kB │

├─────────────────────┼──────────┼───────────────┼────────────────┤

│ 05-11 21:46 │ 0.077 s │ 0.065 s │ 65 kB │

├─────────────────────┼──────────┼───────────────┼────────────────┤

│ 05-11 21:51 │ 0.647 s │ 0.557 s │ 719 kB │

├─────────────────────┼──────────┼───────────────┼────────────────┤

│ 05-11 22:22 │ 4.231 s │ 3.667 s │ 403 kB │

├─────────────────────┼──────────┼───────────────┼────────────────┤

│ 05-12 00:01 → 00:07 │ 76.843 s │ 31.879 s │ 8 kB │

└─────────────────────┴──────────┴───────────────┴────────────────┘

longest went from ~50 ms to 31.879 s to flush 8 kB of WAL — ~4000× over baseline with workload essentially idle, so

this can't be explained by app pressure. App-visible breakage started 7 minutes after the catastrophic checkpoint:

2026-05-12 00:14:16 UTC ERROR: canceling statement due to statement timeout

while locking tuple (41,32) in relation "session"

STATEMENT: update "session" set "expiresAt" = $1, …

PG logs in this window contain no FATAL / PANIC / ENOSPC / could not extend file. Volume usage at the time was ~1%. So

it's not disk-full, not a PG fault, not a code change.

We resized the volume as a mitigation at ~01:32 UTC. After the remount, PG reached ready to accept connections in 1.2

s on the same data. The next hour of checkpoints, on a higher workload than during the failure:

┌─────────────┬─────────┬─────────┬───────────┐

│ Time │ sync │ longest │ distance │

├─────────────┼─────────┼─────────┼───────────┤

│ 05-12 01:38 │ 0.043 s │ 0.015 s │ 1 693 kB │

├─────────────┼─────────┼─────────┼───────────┤

│ 05-12 01:46 │ 0.126 s │ 0.036 s │ 13 572 kB │

├─────────────┼─────────┼─────────┼───────────┤

│ 05-12 02:14 │ 0.108 s │ 0.027 s │ 8 594 kB │

├─────────────┼─────────┼─────────┼───────────┤

│ 05-12 02:22 │ 0.190 s │ 0.065 s │ 17 272 kB │

└─────────────┴─────────┴─────────┴───────────┘

Back to ~15–200 ms longest. Everything else unchanged — same image, same data, same volume handle — only the

underlying mount changed.

Asks

1. Was there a storage-host event on us-east4-eqdc4a during 05-11 21:51 → 05-12 01:32 UTC?

2. Did the resize-triggered remount move the volume to a different physical host? Post-recovery numbers strongly

suggest yes.

3. Anyone else on us-east4-eqdc4a see fsync drift in this window? Would help to correlate.

Would really appreciate an RCA even retroactively.

Awaiting User Response

1 Replies

Status changed to Awaiting Railway Response Railway 11 days ago


codydearkland
EMPLOYEE

7 days ago

Confirming this was a host-level event, not a Postgres issue.

A compute host in us-east4-eqdc4a had two flapping NIC links during this window. Volumes there are network-attached, so retransmits on the host's storage path translated directly into the fsync latency you observed. The path had been running on partial redundancy from a known-bad cable a few days earlier; the second link degrading is what pushed it over.

Answering your three asks:

  1. Storage-host event: Yes, on the host serving your volume. Network path to storage was degrading from late on 05-11 through 01:30 UTC on 05-12.

  2. Did the remount move you to a different host? No — volumes are host-pinned. What changed at ~01:32 UTC is that we took the bad NIC links out of rotation. Your resize/remount landed right after that, on the same host with a healthy network path. The 1.2s ready time and the recovered checkpoint numbers are from the path being restored, not from a physical move.

  3. Others on the same host: Yes, workloads sharing that compute host may have had some intermittent issues during that same window.

One more thing worth flagging: looking at your recent checkpoints, sync is healthy but write is still stretching out periodically (tens of seconds for small batches). Different shape than the original incident, and worth keeping an eye on.


Status changed to Awaiting User Response Railway 7 days ago


Welcome!

Sign in to your Railway account to join the conversation.

Loading...