24 days ago
Production Postgres service atehra-db has been in a restart loop since ~22:19 UTC on 2026-05-19. Every automatic restart and every manual Restart from the dashboard fails with the same container init error before Postgres can start. The dependent API service is also down via cascade (P1001).
— Project info
Project: atehra
Project ID: 7c2dbf4e-6dee-47de-81f4-5d4833748baa
Environment: production
Environment ID: fe8f7634-7167-4cf8-8337-080162c76a26
Service: atehra-db (Postgres) — postgres-production-3c590…
Volume: postgres-volume (vol_hihlyoerch3phuhw)
Region: US East
Plan: Hobby
— Deploy log (atehra-db), repeating indefinitely:
Mounting volume on: /var/lib/containers/railwayapp/bind-mounts/a88dfa0e-cd02-4a92-b368-d3fdbef9aaf5/vol_hihlyoerch3phuhw ERROR (catatonit:2): failed to exec pid1: No such file or directory
Postgres never reaches its own startup output. No Postgres logs after the crash — only the catatonit error in a loop. The dashboard occasionally flashes "Deployment successful" for a few seconds before crashing again.
— Cascade impact
- atehra-api (api.atehra.com): crash-loops with
Error: P1001: Can't reach database server at monorail.proxy.rlwy.net:17295until restartPolicyMaxRetries exhausts. - atehra-web, atehra-worker, atehra-mcp: technically online but degraded (any DB-backed request fails).
- Redis: unaffected.
— Timeline (UTC)
- 2026-05-19 22:17–22:19: last healthy Postgres activity (normal SSL EOFs from clients disconnecting, then a successful checkpoint at 22:19:19).
- Immediately after: catatonit pid1 errors begin. No further Postgres output of any kind.
- 2026-05-20 ~11:30: manual Restart from dashboard. Deployment marked "successful" briefly, then crashes back into the same loop. Same for atehra-api restart.
- Ongoing.
This appears related to the active "Builds are slow to progress" incident banner shown in the dashboard.
— Constraints
Production data is in postgres-volume. Please DO NOT recreate the volume or wipe it — data preservation is the priority. We have not yet confirmed the latest
backup state; happy to help verify that on your side.
— What I need
Help getting the Postgres container to start again without losing the volume. If the issue is with the underlying host/image, please migrate the service to a healthy host so it can mount the existing volume and resume. Happy to provide any further IDs, logs, or temporary access if useful.
This is a full production outage. Any priority you can give it is greatly appreciated.
4 Replies
Status changed to Awaiting Railway Response Railway • 24 days ago
24 days ago
Same thing here. Also seems that this constant restart loop burned all my credits, at least thats what it says (sometimes), but I can't load my usage site to check that. The whole website is still so laggy and slow.
24 days ago
Apologies for this canned message but in an effort to help all our customers get back up and running, we are sending this bulk message. As you may know, we had a major interruption to our services yesterday. We've published a post-mortem if you'd like more information on the incident. It describes what happened and what we are doing to prevent it in the future. We are deeply sorry for the impact that it has had on you.
It is taking some time to bring everything back up, but we are working on it as fast as we can. In general, a redeployment should fix most service issues. Due to the volume of customers redeploying right now, builds and deploys may take longer than normal to process.
You can track recovery status here: https://status.railway.com/incident/KVZ1Z8GY
If you are still having other issues that might be related to the incident you can read more here: https://station.railway.com/community/road-to-recovery-post-gcp-outage-builds-d362e48c
Feel free to respond if your question has not been addressed.
Status changed to Awaiting User Response Railway • 24 days ago
sam-a
Apologies for this canned message but in an effort to help all our customers get back up and running, we are sending this bulk message. As you may know, we had a major interruption to our services yesterday. [We've published a post-mortem if you'd like more information on the incident](https://blog.railway.com/p/incident-report-may-19-2026-gcp-account-outage). It describes what happened and what we are doing to prevent it in the future. We are deeply sorry for the impact that it has had on you. It is taking some time to bring everything back up, but we are working on it as fast as we can. In general, a redeployment should fix most service issues. Due to the volume of customers redeploying right now, builds and deploys may take longer than normal to process. You can track recovery status here: https://status.railway.com/incident/KVZ1Z8GY If you are still having other issues that might be related to the incident you can read more here: https://station.railway.com/community/road-to-recovery-post-gcp-outage-builds-d362e48c Feel free to respond if your question has not been addressed.
24 days ago
Thanks — I read the post-mortem and the road-to-recovery thread.
In our case redeploying does NOT recover the service. The Postgres container has been crash-looping for over 10 hours with the same low-level container init error on every restart
and every manual redeploy:
Mounting volume on: /var/lib/containers/railwayapp/bind-mounts/a88dfa0e-cd02-4a92-b368-d3fdbef9aaf5/vol_hihlyoerch3phuhw
ERROR (catatonit:2): failed to exec pid1: No such file or directoryPostgres itself never starts — only the catatonit init error repeats. The dashboard occasionally flashes "Deployment successful" for a few seconds and then drops back into the
same loop. This does not match the build/deploy slowness described in the recovery thread; it looks like either the volume mount or the underlying container image on the host this
service is currently allocated to is broken at the platform layer.
Details:
- Project: atehra (7c2dbf4e-6dee-47de-81f4-5d4833748baa)
- Environment: production (fe8f7634-7167-4cf8-8337-080162c76a26)
- Service: atehra-db (Postgres)
- Volume: postgres-volume (vol_hihlyoerch3phuhw)
- Region: US East
- Down since: ~22:19 UTC 2026-05-19
To protect data we have NOT recreated/detached the volume and have NOT pointed atehra-api at an alternative database.
Could someone migrate this service to a healthy host so it can remount the existing postgres-volume? We strongly prefer recovering the volume over restoring from a (potentially
stale) backup.
Happy to provide any further logs, IDs, or temporary access. Thank you.
Status changed to Awaiting Railway Response Railway • 24 days ago
23 days ago
Hey, sorry about the long downtime and the canned responses earlier. Your Postgres is still crash-looping with the catatonit pid1 error, which means the container image that was pulled during the outage is broken.
The fix that's worked for other users hitting this same error: go to your atehra-db service Settings, find the Postgres version, and update it (e.g. if you're on 16, change to 16.8 or 17). This forces a fresh image pull while keeping your existing volume and data intact. Once Postgres is back, redeploy atehra-api and it should reconnect.
Your data is safe on the volume, this only changes the container image, not the volume.
Let us know if that works or if you run into anything else.
Status changed to Awaiting User Response Railway • 23 days ago
16 days ago
This thread has been marked as solved automatically due to a lack of recent activity. Please re-open this thread or create a new one if you require further assistance. Thank you!
Status changed to Solved Railway • 16 days ago