Severe latency on Railway since earlier today affecting inserts/updates
cwomil
PROOP

18 days ago

We’re seeing severe latency on Railway since earlier today, with no app deploy or traffic spike (in fact, my whole environment is pre launch so it's effectively just one user at the time for the most part, up to max 3). Added a lot of granular debugging logs to try to understand what was going on.

  • Symptoms:
    • Backend response times spike to 8-10s.
    • Postgres CPU/memory/network metrics look flat.
    • Reads are fast, but simple writes/commits are slow.
    • Example from app-level probes: * SELECT 1: 14ms * no-op write: 652ms * transaction write body: 15ms * transaction commit/return: 143ms-2.2s * one-row inserts/updates: 1-4s
    • This affects simple game restart/startup flows with one active user. This looks like platform/DB write or commit latency rather than app load. Can you check for regional Postgres/internal network/storage latency incidents?
$20 Bounty

5 Replies

Railway
BOT

18 days ago

This thread has been marked as public for community involvement, as it does not contain any sensitive or personal information. Any further activity in this thread will be visible to everyone.

Status changed to Open Railway 18 days ago


laurentcompania
PRO

18 days ago

Same exact pattern here today on a MySQL service in europe-west4 — different DB engine, identical symptom profile:

  • SELECTs fast (~50-150ms warm), buffer pool reads fine
  • INSERT/UPDATE on indexed PK: 1-6 seconds, sometimes more
  • Auth login (read + bcrypt only, no write): now 13-28 seconds
  • Postgres equivalent of your finding: in MySQL I see UPDATEs sitting in waiting for handler commit (= waiting on fsync) for several seconds at a time
  • Same flat CPU/memory/network metrics
  • No deploy, no traffic spike — went from 100% normal to broken at 05:18 UTC this morning

Earlier today the same issue caused a 45-min outage on my side: writes piled up → connections wedged → restart got stuck on Mounting volume on: /var/lib/containers/railwayapp/bind-mounts/... for 30+ minutes before the host released it. InnoDB crash recovery then took another 12 minutes.

There's at least one other thread today reporting the same on Pro plan MySQL with 0.66 fsyncs/sec in INNODB STATUS — clear sign of a host-level storage degradation.

This is definitely not isolated — Postgres + multiple MySQL services across the day, all with the same write/commit slowness signature, all on Railway. Looks like an underlying storage/host incident affecting multiple DBs.


iacopoinved
PRO

18 days ago

+1 on Postgres too, this is a severe degradation of performances that should be taken seriously


tude-diniz
PRO

18 days ago

We experienced the exact same symptoms last week, fast reads, slow writes, commit latency spiking to seconds with no traffic increase or deploy. Here's what we found and fixed:

1. shared_buffers too small (biggest impact)

The default PostgreSQL shared_buffers on Railway is 128MB. When your working dataset exceeds this, backends start writing dirty buffers directly to disk (buffers_backend), which shows up as WALSync and

WALWrite waits during commits , exactly what you're seeing.

Fix: Run ALTER SYSTEM SET shared_buffers = ‘6GB'; (or more depending on your plan) and restart the Postgres service. Note: this requires a restart to take effect.

2. synchronous_commit = off (immediate relief, no restart needed)

ALTER SYSTEM SET synchronous_commit = off;

SELECT pg_reload_conf();

This prevents commits from waiting for WAL fsync. The tradeoff is up to ~200ms of data loss on a hard crash, no corruption risk. This gave us immediate improvement on commit latency.

3. checkpoint_timeout increase

ALTER SYSTEM SET checkpoint_timeout = '15min';

SELECT pg_reload_conf();

The default 5-minute checkpoint was causing periodic WALSync stalls every few minutes.

4. Regarding the Railway restart

We had an issue where restarting the Postgres container to apply shared_buffers caused the container to get stuck at Mounting volume for 20+ minutes. It only resolved after triggering a redeploy (not just a restart). Worth being aware of if you go that route, do it during low traffic.

Hope this helps narrow it down for you.


tude-diniz

We experienced the exact same symptoms last week, fast reads, slow writes, commit latency spiking to seconds with no traffic increase or deploy. Here's what we found and fixed: **1\. shared\_buffers too small (biggest impact)** The default PostgreSQL shared\_buffers on Railway is **128MB**. When your working dataset exceeds this, backends start writing dirty buffers directly to disk (buffers\_backend), which shows up as WALSync and WALWrite waits during commits , exactly what you're seeing. Fix: Run ALTER SYSTEM SET shared\_buffers = ‘6GB'; (or more depending on your plan) and restart the Postgres service. Note: this requires a restart to take effect. **2\. synchronous\_commit = off (immediate relief, no restart needed)** ALTER SYSTEM SET synchronous\_commit = off; SELECT pg\_reload\_conf(); This prevents commits from waiting for WAL fsync. The tradeoff is up to \~200ms of data loss on a hard crash, no corruption risk. This gave us immediate improvement on commit latency. **3\. checkpoint\_timeout increase** ALTER SYSTEM SET checkpoint\_timeout = '15min'; SELECT pg\_reload\_conf(); The default 5-minute checkpoint was causing periodic WALSync stalls every few minutes. **4\. Regarding the Railway restart** We had an issue where restarting the Postgres container to apply shared\_buffers caused the container to get stuck at Mounting volume for 20+ minutes. It only resolved after triggering a redeploy (not just a restart). Worth being aware of if you go that route, do it during low traffic. Hope this helps narrow it down for you.

cwomil
PROOP

18 days ago

Thanks, this was very useful.

We checked our DB settings and they are indeed the Railway defaults:

  • shared_buffers = 128MB
  • checkpoint_timeout = 5min
  • synchronous_commit = on
  • max_wal_size = 1GB

We also checked pg_stat_bgwriter. buffers_backend_fsync is 0, so we don’t yet have evidence that backends are directly fsyncing dirty buffers. But during the actual slow commits, pg_stat_activity did catch the connection waiting on:

  • wait_event_type = IO
  • wait_event = WALSync

So your diagnosis looks directionally right when running it past my agents too: the stall is at WAL sync/commit, not app logic.

Haven’t applied the config changes yet because:

  • synchronous_commit = off is a durability tradeoff, and this DB also backs auth/entitlements, not just disposable game state.
  • Increasing shared_buffers requires sizing against the actual Railway plan memory and a Postgres restart/redeploy.
  • This started today before we made any code or DB changes that would explain it, with the same traffic pattern and effectively one user.

That’s why we’re hesitant to treat this purely as an app/database tuning issue. The defaults may make the DB more vulnerable to WAL sync stalls, but the sudden onset still makes this look like a Railway/platform/storage regression or incident.

If Railway does not respond in a reasonable time frame, we may first try a less risky tuning change like increasing checkpoint_timeout, then evaluate shared_buffers with the correct memory budget.


cwomil
PROOP

18 days ago

Not sure what the normal procedure is, but in case someone's reading this and have not seen the message from Railway acknowledging that it's on their side:

Attachments


Welcome!

Sign in to your Railway account to join the conversation.

Loading...