Two Postgres services crash-looping with catatonit: failed to exec pid1 (suspected May 19 outage fallout)

docrocket-mad

PROOP

a month ago

Two of my production Postgres services have been continuously crash-looping since around the May 19, 2026 ~22:22 UTC edge network outage (your status page incident). Both fail with identical symptoms but on different bind mounts, so it looks like a systemic issue with volume reattachment rather than a single corrupted volume.

Affected services:

Service Bind mount UUID

Site-OS Postgres 56c4d747-a5d4-48c9-8fa6-6160346e79aa

Sandhopper-OS Postgres 7316d24d-d101-44b8-a9af-eeac6785bac6

Symptom (identical on both — repeating ~every few seconds):

Mounting volume on: /var/lib/containers/railwayapp/bind-mounts//vol_

ERROR (catatonit:2): failed to exec pid1: No such file or directory

The container can't even reach the Postgres binary to start it. Healthchecks never get a chance to run.

Same UUID 56c4d747-... was also affecting my LAB-OS MCP application service — that one I was able to resolve by adding a service-scoped railway.json (config issue on my end). But these two Postgres services have no code on the application side that could fix this; the binary isn't accessible to the container init.

What I'm asking for:

Reattach or restore the bind mounts for both services. The data is presumably intact on the underlying storage; the issue is the volume mount handoff.

If the data on either volume is unrecoverable, please let me know ASAP — both DBs are pre-production but I'd want to confirm before I plan a reseed.

If this is a wider regional issue I should know about, please flag it so I can monitor my other services.

Solved

2 Replies

Status changed to Awaiting Railway Response Railway • about 1 month ago

sam-a

EMPLOYEE

a month ago

Apologies for this canned message but in an effort to help all our customers get back up and running, we are sending this bulk message. As you may know, we had a major interruption to our services yesterday. We've published a post-mortem if you'd like more information on the incident. It describes what happened and what we are doing to prevent it in the future. We are deeply sorry for the impact that it has had on you.

It is taking some time to bring everything back up, but we are working on it as fast as we can. In general, a redeployment should fix most service issues. Due to the volume of customers redeploying right now, builds and deploys may take longer than normal to process.

You can track recovery status here: https://status.railway.com/incident/KVZ1Z8GY

If you are still having other issues that might be related to the incident you can read more here: https://station.railway.com/community/road-to-recovery-post-gcp-outage-builds-d362e48c

Feel free to respond if your question has not been addressed.

Status changed to Awaiting User Response Railway • about 1 month ago

sam-a

EMPLOYEE

a month ago

You can track recovery status here: https://status.railway.com/incident/KVZ1Z8GY

If you are still having other issues that might be related to the incident you can read more here: https://station.railway.com/community/road-to-recovery-post-gcp-outage-builds-d362e48c

Feel free to respond if your question has not been addressed.

Railway

BOT

a month ago

This thread has been marked as solved automatically due to a lack of recent activity. Please re-open this thread or create a new one if you require further assistance. Thank you!

Status changed to Solved Railway • 28 days ago

Welcome!