Managed Postgres atehra-db in restart loop: catatonit failed to exec pid1 — production down
p0ssato
HOBBYOP

24 days ago

Production Postgres service atehra-db has been in a restart loop since ~22:19 UTC on 2026-05-19. Every automatic restart and every manual Restart from the dashboard fails with the same container init error before Postgres can start. The dependent API service is also down via cascade (P1001).

— Project info

Project: atehra

Project ID: 7c2dbf4e-6dee-47de-81f4-5d4833748baa

Environment: production

Environment ID: fe8f7634-7167-4cf8-8337-080162c76a26

Service: atehra-db (Postgres) — postgres-production-3c590…

Volume: postgres-volume (vol_hihlyoerch3phuhw)

Region: US East

Plan: Hobby

— Deploy log (atehra-db), repeating indefinitely:

Mounting volume on: /var/lib/containers/railwayapp/bind-mounts/a88dfa0e-cd02-4a92-b368-d3fdbef9aaf5/vol_hihlyoerch3phuhw ERROR (catatonit:2): failed to exec pid1: No such file or directory

Postgres never reaches its own startup output. No Postgres logs after the crash — only the catatonit error in a loop. The dashboard occasionally flashes "Deployment successful" for a few seconds before crashing again.

— Cascade impact

  • atehra-api (api.atehra.com): crash-loops with Error: P1001: Can't reach database server at monorail.proxy.rlwy.net:17295 until restartPolicyMaxRetries exhausts.
  • atehra-web, atehra-worker, atehra-mcp: technically online but degraded (any DB-backed request fails).
  • Redis: unaffected.

— Timeline (UTC)

  • 2026-05-19 22:17–22:19: last healthy Postgres activity (normal SSL EOFs from clients disconnecting, then a successful checkpoint at 22:19:19).
  • Immediately after: catatonit pid1 errors begin. No further Postgres output of any kind.
  • 2026-05-20 ~11:30: manual Restart from dashboard. Deployment marked "successful" briefly, then crashes back into the same loop. Same for atehra-api restart.
  • Ongoing.

This appears related to the active "Builds are slow to progress" incident banner shown in the dashboard.

— Constraints

Production data is in postgres-volume. Please DO NOT recreate the volume or wipe it — data preservation is the priority. We have not yet confirmed the latest

backup state; happy to help verify that on your side.

— What I need

Help getting the Postgres container to start again without losing the volume. If the issue is with the underlying host/image, please migrate the service to a healthy host so it can mount the existing volume and resume. Happy to provide any further IDs, logs, or temporary access if useful.

This is a full production outage. Any priority you can give it is greatly appreciated.

Solved

4 Replies

Status changed to Awaiting Railway Response Railway 24 days ago


benkoivor96
HOBBY

24 days ago

Same thing here. Also seems that this constant restart loop burned all my credits, at least thats what it says (sometimes), but I can't load my usage site to check that. The whole website is still so laggy and slow.


sam-a
EMPLOYEE

24 days ago

Apologies for this canned message but in an effort to help all our customers get back up and running, we are sending this bulk message. As you may know, we had a major interruption to our services yesterday. We've published a post-mortem if you'd like more information on the incident. It describes what happened and what we are doing to prevent it in the future. We are deeply sorry for the impact that it has had on you.

It is taking some time to bring everything back up, but we are working on it as fast as we can. In general, a redeployment should fix most service issues. Due to the volume of customers redeploying right now, builds and deploys may take longer than normal to process.

You can track recovery status here: https://status.railway.com/incident/KVZ1Z8GY

If you are still having other issues that might be related to the incident you can read more here: https://station.railway.com/community/road-to-recovery-post-gcp-outage-builds-d362e48c

Feel free to respond if your question has not been addressed.


Status changed to Awaiting User Response Railway 24 days ago


sam-a

Apologies for this canned message but in an effort to help all our customers get back up and running, we are sending this bulk message. As you may know, we had a major interruption to our services yesterday. [We've published a post-mortem if you'd like more information on the incident](https://blog.railway.com/p/incident-report-may-19-2026-gcp-account-outage). It describes what happened and what we are doing to prevent it in the future. We are deeply sorry for the impact that it has had on you. It is taking some time to bring everything back up, but we are working on it as fast as we can. In general, a redeployment should fix most service issues. Due to the volume of customers redeploying right now, builds and deploys may take longer than normal to process. You can track recovery status here: https://status.railway.com/incident/KVZ1Z8GY If you are still having other issues that might be related to the incident you can read more here: https://station.railway.com/community/road-to-recovery-post-gcp-outage-builds-d362e48c Feel free to respond if your question has not been addressed.

p0ssato
HOBBYOP

24 days ago

Thanks — I read the post-mortem and the road-to-recovery thread.

In our case redeploying does NOT recover the service. The Postgres container has been crash-looping for over 10 hours with the same low-level container init error on every restart

and every manual redeploy:

Mounting volume on: /var/lib/containers/railwayapp/bind-mounts/a88dfa0e-cd02-4a92-b368-d3fdbef9aaf5/vol_hihlyoerch3phuhw

ERROR (catatonit:2): failed to exec pid1: No such file or directory

Postgres itself never starts — only the catatonit init error repeats. The dashboard occasionally flashes "Deployment successful" for a few seconds and then drops back into the

same loop. This does not match the build/deploy slowness described in the recovery thread; it looks like either the volume mount or the underlying container image on the host this

service is currently allocated to is broken at the platform layer.

Details:

  • Project: atehra (7c2dbf4e-6dee-47de-81f4-5d4833748baa)
  • Environment: production (fe8f7634-7167-4cf8-8337-080162c76a26)
  • Service: atehra-db (Postgres)
  • Volume: postgres-volume (vol_hihlyoerch3phuhw)
  • Region: US East
  • Down since: ~22:19 UTC 2026-05-19

To protect data we have NOT recreated/detached the volume and have NOT pointed atehra-api at an alternative database.

Could someone migrate this service to a healthy host so it can remount the existing postgres-volume? We strongly prefer recovering the volume over restoring from a (potentially

stale) backup.

Happy to provide any further logs, IDs, or temporary access. Thank you.


Status changed to Awaiting Railway Response Railway 24 days ago


chandrika
EMPLOYEE

23 days ago

Hey, sorry about the long downtime and the canned responses earlier. Your Postgres is still crash-looping with the catatonit pid1 error, which means the container image that was pulled during the outage is broken.

The fix that's worked for other users hitting this same error: go to your atehra-db service Settings, find the Postgres version, and update it (e.g. if you're on 16, change to 16.8 or 17). This forces a fresh image pull while keeping your existing volume and data intact. Once Postgres is back, redeploy atehra-api and it should reconnect.

Your data is safe on the volume, this only changes the container image, not the volume.

Let us know if that works or if you run into anything else.


Status changed to Awaiting User Response Railway 23 days ago


Railway
BOT

16 days ago

This thread has been marked as solved automatically due to a lack of recent activity. Please re-open this thread or create a new one if you require further assistance. Thank you!

Status changed to Solved Railway 16 days ago


Welcome!

Sign in to your Railway account to join the conversation.

Loading...