Postgres WALSync/WALWrite stalls causing 15-60s auth requests

onlywhisky

HOBBYOP

22 days ago

Hi Railway team,

I’m on the Hobby plan and seeing intermittent very slow requests against Railway Postgres. The app requests are tiny auth writes, but they sometimes take 15-60 seconds.

As an example this happened around:

2026-06-02T10:09:00Z to 2026-06-02T10:13:30Z

Prisma query event logging, with params disabled/redacted, showed small single-row writes taking tens of seconds:

INSERT INTO "public"."oauth_auth_codes" (...) VALUES (...) RETURNING ...
durationMs=43451

INSERT INTO "public"."oauth_auth_codes" (...) VALUES (...) RETURNING ...
durationMs=27054

ROLLBACK
durationMs=23735

While these requests were still open, app-side DB diagnostics from pg_stat_activity showed active Postgres backends waiting on WAL/storage events:

pid=11948
state=active
wait_event_type=IO
wait_event=WalSync
query_age=00:00:19.750559
query=INSERT INTO "public"."oauth_auth_codes" (...) VALUES (...) RETURNING ...

pid=11949
state=active
wait_event_type=LWLock
wait_event=WALWrite
query=COMMIT

pid=11953
state=active
wait_event_type=LWLock
wait_event=WALWrite
query=INSERT INTO "public"."oauth_auth_codes" (...) VALUES (...) RETURNING ...

There was also one snapshot showing:

wait_event_type=IO
wait_event=DataFileRead
query_age=00:00:19.877743
query=INSERT INTO "public"."oauth_auth_codes" (...) VALUES (...) RETURNING ...

This does not look like a large query or app-side CPU issue. It looks like Postgres/storage/WAL fsync or write latency.

Could you please check the managed Postgres instance/storage layer around that UTC window, especially WAL fsync/write latency, storage throttling, checkpoint/backup activity, or host-level issues?

I can provide more sanitized request IDs or timestamps if useful. I'm facing this issue for more than a week now.

Solved

7 Replies

Status changed to Awaiting Railway Response Railway • 22 days ago

jake

EMPLOYEE

21 days ago

That ties back to application-level inefficiencies. We aren't seeing anything on our end that would cause this.

Status changed to Awaiting User Response Railway • 21 days ago

jake

That ties back to application-level inefficiencies. We aren't seeing anything on our end that would cause this.

onlywhisky

HOBBYOP

21 days ago

Could you clarify what specific application-level inefficiency would cause PostgreSQL itself to report prolonged WalSync / WALWrite waits for 20+ seconds on tiny single-row INSERTs and COMMITs?

From the app side, the backend was already inside Postgres, waiting on WAL/storage-related events:

IO / WalSync on INSERT

LWLock / WALWrite on COMMIT / INSERT

That does not look like Prisma CPU time, query planning, locks, or connection waiting.

If Railway is not seeing a platform incident, could you please tell me what additional evidence would distinguish app inefficiency from storage/WAL flush latency on a Hobby Postgres instance?

I can capture pg_stat_activity, pg_stat_bgwriter, transaction ages, connection counts, and exact UTC timestamps during the next occurrence.

Status changed to Awaiting Railway Response Railway • 21 days ago

sam-a

EMPLOYEE

21 days ago

Apologies for the earlier response - your analysis was correct. This is not an application-level issue.

We've confirmed that the host serving your Postgres volume has degraded storage I/O, which directly explains the 12-80s checkpoint stalls and the WALSync/WALWrite waits you captured. The host has already been taken out of rotation for new workloads.

To resolve this, redeploy your Postgres service. This will migrate your volume to a healthy host. Your volume is small (~119 MB), so the migration should be quick. You can trigger a redeploy from the service's three-dot menu on the project canvas. There will be brief downtime during the migration since the volume needs to detach from the old host and reattach on the new one.

Status changed to Awaiting User Response Railway • 21 days ago

Status changed to Solved sam-a • 21 days ago

sam-a

Apologies for the earlier response - your analysis was correct. This is not an application-level issue. We've confirmed that the host serving your Postgres volume has degraded storage I/O, which directly explains the 12-80s checkpoint stalls and the WALSync/WALWrite waits you captured. The host has already been taken out of rotation for new workloads. To resolve this, redeploy your Postgres service. This will migrate your volume to a healthy host. Your volume is small (~119 MB), so the migration should be quick. You can trigger a redeploy from the service's three-dot menu on the project canvas. There will be brief downtime during the migration since the volume needs to detach from the old host and reattach on the new one.

onlywhisky

HOBBYOP

20 days ago

Thanks a lot for your response.

I have redeployed the Postgres but unfortunatelly I still see WAL locks:

During a slow OAuth login on 2026-06-03 around 17:49 UTC, /api/auth/google/callback took 24.3s. While it was still in flight, pg_stat_activity showed active Postgres backends waiting on WAL events.

The pg_stat_activity logs below are snapshots, so query_age is the age at capture time, the final postgress storage write was actually longer.


event_time=2026-06-03T17:49:22.503Z
request_path=/api/auth/google/callback
request_completed_duration_ms=24313

query=INSERT INTO "public"."oauth_auth_codes" (...) VALUES (...) RETURNING ...

pid=269 state=active wait_event_type=IO wait_event=WalSync query_age=00:00:09.393606 query=COMMIT
pid=270 state=active wait_event_type=LWLock wait_event=WALWrite query_age=00:00:04.873646 query=INSERT INTO "public"."oauth_auth_codes" (...) VALUES (...) RETURNING ...

Status changed to Awaiting Railway Response Railway • 20 days ago

jake

EMPLOYEE

19 days ago

Please link the service?

Status changed to Awaiting User Response Railway • 19 days ago

onlywhisky

HOBBYOP

18 days ago

Here is problematic database:

https://railway.com/project/d3083818-6a08-4dd3-8a25-cd037ae2c502/service/43355339-9e97-4344-8239-adedd5e08d57/database?environmentId=d4817046-75a2-49e0-bd78-9a4d49325b45

FYI I have created a new database alongside and migrated all the data and the new database doesn't show any signs of the problem. So clearly the old host is not doing well. Since I migrated I'm ready to delete old one but kept it in case Railway will be interested to understand root cause and prevent future reproductions.

Status changed to Awaiting Railway Response Railway • 18 days ago

dizzydes90

EMPLOYEE

18 days ago

Glad the new database is working well. That confirms the issue was the underlying host, which we've already identified and cordoned. We have all the data we need from the old instance, so feel free to delete it whenever you're ready.

Status changed to Awaiting User Response Railway • 18 days ago

Status changed to Solved dizzydes90 • 18 days ago

Welcome!