Severe Postgres disk I/O latency — 38s COMMIT on a 40MB database

bondi-admin

PROOP

2 months ago

I'm experiencing severe and intermittent disk I/O latency on my Postgres instance. The most striking symptom is a COMMIT statement taking 38.5 seconds to complete on a database that is only ~40MB in total size. Since the entire dataset fits in shared_buffers many times over, this can only be explained by extremely slow fsync performance on the underlying storage.

This is causing user-facing request timeouts in production, even though the database itself is tiny and the queries are well-optimized.

Evidence

1. A 38.5-second COMMIT

The clearest signal of storage I/O problems is this log line:

2026-05-06 23:15:45.075 UTC [4173] LOG:  duration: 38562.177 ms  statement: COMMIT;

A COMMIT doing nothing but flushing WAL to disk should complete in microseconds to a few milliseconds on healthy storage. Taking 38 seconds on a 40MB database with no synchronous replication can only be caused by fsync blocking on slow storage.

While this COMMIT was blocked, it was holding all locks acquired by the transaction, causing other queries to queue behind it.

2. Long-tailed query latencies on trivial queries

pg_stat_statements shows extreme variance on simple primary-key/indexed lookups, despite EXPLAIN ANALYZE showing sub-millisecond execution times:

Query calls mean max stddev SELECT * FROM order_links WHERE public_id = $1 LIMIT 1 22,246 0.77 ms 14,579 ms 97 ms SELECT * FROM drivers WHERE user_id = $1 LIMIT 1 1,593 12.80 ms 9,949 ms 352 ms SELECT * FROM order_services WHERE id = $1 LIMIT 1 1,280 3.78 ms 4,693 ms 131 ms

For example, EXPLAIN (ANALYZE, BUFFERS) on the drivers query:

Limit  (cost=0.28..2.50 rows=1 width=122) (actual time=0.076..0.077 rows=1 loops=1)
  ->  Index Scan using drivers_user_id_idx on drivers
        Index Cond: (user_id = 873)
Planning Time: 2.188 ms
Execution Time: 0.127 ms

The query executes in 0.127 ms, but occasionally takes 9.9 seconds in production. The factor of ~80,000x cannot be explained by query plan, indexing, or data size. This is consistent with queries waiting on locks held by other transactions that are themselves blocked on slow fsync.

3. Checkpoint timing

Checkpoints themselves show acceptable write and sync times individually, but sync times occasionally spike:

2026-05-06 23:15:47.473 UTC checkpoint complete: wrote 22 buffers (0.0%);
  write=2.052 s, sync=0.063 s, total=49.839 s
  sync files=22, longest=0.047 s, average=0.003 s
  distance=144 kB

2026-05-06 23:10:47.535 UTC checkpoint complete: wrote 23 buffers (0.0%);
  write=2.059 s, sync=0.859 s, total=50.325 s
  sync files=23, longest=0.779 s, average=0.038 s
  distance=143 kB

2026-05-07 00:15:55.493 UTC checkpoint complete: wrote 8 buffers (0.0%);
  write=0.834 s, sync=0.768 s, total=55.583 s
  sync files=8, longest=0.679 s, average=0.096 s
  distance=30 kB

Note the tiny amount of data being checkpointed (143–287 kB, single-digit-to-dozens of buffers). For checkpoints this small to have individual fsync operations taking up to 779 ms ("longest" field) suggests the storage is not consistently responsive.

What I've ruled out

Database size / cache misses: The database is 40MB; everything fits in RAM many times over.
Query plans / missing indexes:EXPLAIN ANALYZE confirms sub-millisecond execution for the slow queries.
Application-side lock contention:lock_timeout is set to 5s, so heavyweight lock waits would abort, but queries are completing successfully after 9–14 seconds — meaning they were blocked on something other than a heavyweight lock (consistent with waiting on a transaction stuck in COMMIT fsync).
Synchronous replication:synchronous_standby_names is empty.
Application bugs holding transactions open: Even when this happens, it cannot make a COMMIT itself take 38 seconds — that time is spent inside the storage layer.

What I'd like to understand

Are there known I/O performance issues or limits on the storage tier my Postgres instance is using?
Is there visibility into disk-level metrics (IOPS, latency, queue depth) for my Postgres service that I can access?
Is there a "noisy neighbor" or shared I/O contention pattern on the underlying host that could explain occasional multi-second fsync stalls?
Are there higher-tier plans or storage options with provisioned/dedicated IOPS that would give more consistent latency?

I'm planning to mitigate on my side by setting synchronous_commit = off and adding application-level retries with idempotency keys, but those are workarounds for what appears to be an underlying storage issue — and I'd like to understand the root cause before deciding whether to stay on the current plan or migrate.

Happy to provide additional logs, query samples, or run any diagnostics you'd suggest.

Thanks!

Solved

3 Replies

bondi-admin

PROOP

2 months ago

Environment

Service: Postgres (Railway-managed)
Database size: ~40MB (pg_database_size = 39,959,575 bytes)
Postgres configuration (relevant):
- shared_buffers = 8GB
- effective_cache_size = 24GB
- synchronous_commit = on
- synchronous_standby_names = (empty, no replication)
- max_wal_size = 4GB
- min_wal_size = 1GB
- checkpoint_completion_target = 0.9

Given the 40MB database size and 8GB shared_buffers, the entire working set is cached in RAM. Disk reads are essentially never on the hot path — only writes (WAL fsync, checkpoints) hit storage.

Status changed to Awaiting Railway Response Railway • 2 months ago

ray-chen

EMPLOYEE

2 months ago

We're aware of this and actively investigating. We'll update you as soon as we have more to share.

In the meantime, if you're experiencing impact, redeploying your service to a different region can help as an immediate workaround. Apologies for the trouble!

Status changed to Awaiting User Response Railway • about 2 months ago

chandrika

EMPLOYEE

2 months ago

Hey, sorry for the long wait on this. We've identified the root cause: your Postgres instance is on a storage host that's experiencing degraded I/O performance, which is what's causing the slow COMMITs and fsync stalls you documented.

We've staged a migration to a newer, healthier host on your service. To apply it, you'll need to trigger a redeploy — the volume will migrate along with the service. Since this is a stateful service, there will be a brief period of downtime during the migration (typically a few minutes).

You can redeploy whenever works best for you. Once you're on the new host, the storage latency should return to normal and you shouldn't see those multi-second COMMITs anymore.

Apologies again for the delay — and thanks for the thorough write-up, it helped us track down the issue across multiple affected hosts.

Railway

BOT

a month ago

This thread has been marked as solved automatically due to a lack of recent activity. Please re-open this thread or create a new one if you require further assistance. Thank you!

Status changed to Solved Railway • about 1 month ago

Welcome!