Postgres container failing to exec pid1 since 05/19 GCP outage — catatonit error, volume mount succeeds

tj-nomoai

PROOP

a month ago

Hi Railway team,

Following last night's GCP-account-block incident, my Postgres service has been in a crashed auto-restart loop. Volume

mount succeeds on each attempt, but catatonit fails to exec pid1 with "No such file or directory." Manual Restart

clicks reproduce the identical error. The data volume appears intact (mount succeeds cleanly).

Details:

Project: 3d40829a-ee7a-4a10-b831-c3c0e9b77e97
Environment: 5d1c2af2-494a-4c02-9b21-14d0903eab33 (production)
Service (Postgres): 0bee1202-eb7e-4114-b2b0-95b16de0a0d1
Deployment: 2a488f64-50c3-4d26-9241-8042f5173678
Replica: 3c35a3c7-55b1-4518-893e-a63a849d0e41
Region: europe-west4
Image: ghcr.io/railwayapp-templates/postgres-ssl:17

Recurring log pattern:

Mounting volume on:

/var/lib/containers/railwayapp/bind-mounts/3c35a3c7-55b1-4518-893e-a63a849d0e41/vol_hnsfzv3e1ilg6hqx

ERROR (catatonit:2): failed to exec pid1: No such file or directory

Timeline: Last successful Postgres activity in the logs was a healthy checkpoint at 22:11:05 UTC on May 19 — roughly 9

minutes before the GCP incident started. Everything since has been the auto-restart loop above.

Hypothesis: Either (a) cached image corruption / incomplete re-pull tied to last night's outage, or (b) container

runtime in europe-west4 still in a degraded state. Both would be below the GCP networking layer that the public status

page references.

Ask: Could engineering investigate the container runtime for this replica? Happy to provide more logs / context if

helpful.

Side note for context: My application service (nomo-ai, https://nomo-ai-production.up.railway.app) is downstream of

this Postgres and currently can't boot for the same reason — but I'm not requesting action there, it should

self-resolve once Postgres is back.

Thanks,

TJ Ruff

Nomo AI

Solved

3 Replies

Status changed to Awaiting Railway Response Railway • about 1 month ago

mykal

EMPLOYEE

a month ago

Thanks for reaching out. We sincerely apologize for the service disruption.

We're seeing recovery in our API, builds, and deployments. I just redeployed the linked services and they all seem to have recovered. If other services are having issues, a redeploy should solve them as well. We'll publish a public postmortem covering what happened when we're fully recovered.

For all customers, we’ll publish a detailed postmortem outlining what happened and the steps we’re taking to prevent similar incidents in the future. For Enterprise customers, service credits are covered under our SLA and will be reviewed as part of our post-incident process.

Again, we're deeply sorry you were affected by this.

Please reach out if you're still having issues.

Status changed to Awaiting User Response Railway • about 1 month ago

mykal

Thanks for reaching out. We sincerely apologize for the service disruption. We're seeing recovery in our API, builds, and deployments. I just redeployed the linked services and they all seem to have recovered. If other services are having issues, a redeploy should solve them as well. We'll publish a public postmortem covering what happened when we're fully recovered. For all customers, we’ll publish a detailed postmortem outlining what happened and the steps we’re taking to prevent similar incidents in the future. For Enterprise customers, service credits are covered under our SLA and will be reviewed as part of our post-incident process. Again, we're deeply sorry you were affected by this. Please reach out if you're still having issues.

tj-nomoai

PROOP

a month ago

Confirmed back online — /health returning 200, both Postgres and nomo-ai showing Online in the dashboard, application functioning. Thank you for the direct intervention, as you apparently were the one to get things back.

For context: I'm on the Pro plan, and this was a 7h 43min outage on production. Would Railway consider any goodwill service credit for the downtime, even though Pro isn't covered by the Enterprise SLA. Happy to defer to whatever your post-incident review concludes is appropriate. And all the same, thank YOU for helping us get back.

Status changed to Awaiting Railway Response Railway • about 1 month ago

brody

EMPLOYEE

a month ago

Glad everything is back and healthy. Regarding credits, the Pro plan doesn't include an SLA, so formal service credits aren't available.

Status changed to Awaiting User Response Railway • about 1 month ago

Status changed to Solved mykal • about 1 month ago

Welcome!