Postgres service crash-looping with catatonit pid1 error
jmaie31882
HOBBYOP

25 days ago

Hi team,

My Postgres service has been crash-looping since approximately 08:56 GMT+1 on May 20 2026. Symptoms appear identical to the wave of threads posted in the last 15-20 minutes following the pinned "GCP Suspension Outage: May 19th 2026" post (e.g. "Postgres database fails to start - catatonit pid1 error - data recovery needed", "Postgres service crashes immediately: failed to exec pid1", "Production Postgres WAL corrupted after May 19 GCP suspension"). I believe my service is caught up in the same incident.

Description of the issue:

  • Postgres deployment ran fine for 2 weeks, then began crash-looping today around 08:56 GMT+1
  • Each container start fails within seconds at the init layer
  • Downstream Next.js app (assetflow) cannot connect and returns Prisma P1001 "Can't reach database server at postgres.railway.internal:5432"
  • Free plan, so no Pro-tier automated backups exist on the volume

Error messages (Postgres deploy logs, tight loop):

ERROR (catatonit:2): failed to exec pid1: No such file or directory

Mounting volume on: /var/lib/containers/railwayapp/bind-mounts/af5f7e1e-9f76-4d40-ac7c-12052770ba45/vol_mlb1fd7em4gay133

ERROR (catatonit:2): failed to exec pid1: No such file or directory

[pattern repeats every ~1 second indefinitely]

This indicates catatonit cannot exec the container entrypoint, so Postgres itself never starts and never reads the volume. Given the broader incident, my read is that the underlying issue is volume / WAL state inherited from yesterday's GCP suspension, not a fault in my project's configuration.

What I've already tried:

  1. Restart deployment — crashed within 7 seconds with the same catatonit error
  2. Redeploy (fresh image pull) — same result, same error

I have NOT touched the image tag or the volume because I have no backups and don't want to risk further data loss.

Service details:

  • Project: Asset Flow
  • Project ID: 7c852cfe-af90-4023-87b0-b62c08b47a9c
  • Environment: production (92dc88b2-017c-4e9f-bd3d-9d7153546d90)
  • Postgres service ID: 5380526f-654a-408a-9f35-3773fd0aed3c
  • Latest crashed Postgres deployment: a677ca4a
  • Image: ghcr.io/railwayapp-templates/postgres-ssl:18
  • Volume: postgres-volume (contains my only copy of production data)
  • Public TCP proxy: switchyard.proxy.rlwy.net:45069 → 5432
  • Downstream affected service: assetflow (assetflow.scape.com), also Crashed

Ask:

  • Could the team confirm this is part of the May 19 GCP outage cascade?
  • Is there a recovery path that preserves the data on postgres-volume (e.g. pg_resetwal, manual mount, or a temporary recovery container)?
  • I'm happy to follow any instructions or grant access as needed — please advise on next steps.

Thanks very much.

Solved

2 Replies

Status changed to Awaiting Railway Response Railway 25 days ago


josuetapianefrologo-cmd
PRO

25 days ago

Same issue here - my Postgres service has been in crash loop since

the outage. Logs show "catatonit failed to exec pid1: No such file

or directory" repeatedly. Region: europe-west4-drams3a. Hobby plan.

The container won't restart even after the GCP issue was resolved.

Has anyone been able to recover their volume?


sam-a
EMPLOYEE

25 days ago

Apologies for this canned message but in an effort to help all our customers get back up and running, we are sending this bulk message. As you may know, we had a major interruption to our services yesterday. We've published a post-mortem if you'd like more information on the incident. It describes what happened and what we are doing to prevent it in the future. We are deeply sorry for the impact that it has had on you.

It is taking some time to bring everything back up, but we are working on it as fast as we can. In general, a redeployment should fix most service issues. Due to the volume of customers redeploying right now, builds and deploys may take longer than normal to process.

You can track recovery status here: https://status.railway.com/incident/KVZ1Z8GY

If you are still having other issues that might be related to the incident you can read more here: https://station.railway.com/community/road-to-recovery-post-gcp-outage-builds-d362e48c

Feel free to respond if your question has not been addressed.


Status changed to Awaiting User Response Railway 25 days ago


Railway
BOT

18 days ago

This thread has been marked as solved automatically due to a lack of recent activity. Please re-open this thread or create a new one if you require further assistance. Thank you!

Status changed to Solved Railway 18 days ago


Welcome!

Sign in to your Railway account to join the conversation.

Loading...