21 days ago
Hi Railway team,
Following last night's GCP-account-block incident, my Postgres service has been in a crashed auto-restart loop. Volume
mount succeeds on each attempt, but catatonit fails to exec pid1 with "No such file or directory." Manual Restart
clicks reproduce the identical error. The data volume appears intact (mount succeeds cleanly).
Details:
- Project: 3d40829a-ee7a-4a10-b831-c3c0e9b77e97
- Environment: 5d1c2af2-494a-4c02-9b21-14d0903eab33 (production)
- Service (Postgres): 0bee1202-eb7e-4114-b2b0-95b16de0a0d1
- Deployment: 2a488f64-50c3-4d26-9241-8042f5173678
- Replica: 3c35a3c7-55b1-4518-893e-a63a849d0e41
- Region: europe-west4
- Image: ghcr.io/railwayapp-templates/postgres-ssl:17
Recurring log pattern:
Mounting volume on:
/var/lib/containers/railwayapp/bind-mounts/3c35a3c7-55b1-4518-893e-a63a849d0e41/vol_hnsfzv3e1ilg6hqx
ERROR (catatonit:2): failed to exec pid1: No such file or directory
Timeline: Last successful Postgres activity in the logs was a healthy checkpoint at 22:11:05 UTC on May 19 — roughly 9
minutes before the GCP incident started. Everything since has been the auto-restart loop above.
Hypothesis: Either (a) cached image corruption / incomplete re-pull tied to last night's outage, or (b) container
runtime in europe-west4 still in a degraded state. Both would be below the GCP networking layer that the public status
page references.
Ask: Could engineering investigate the container runtime for this replica? Happy to provide more logs / context if
helpful.
Side note for context: My application service (nomo-ai, https://nomo-ai-production.up.railway.app) is downstream of
this Postgres and currently can't boot for the same reason — but I'm not requesting action there, it should
self-resolve once Postgres is back.
Thanks,
TJ Ruff
Nomo AI
3 Replies
Status changed to Awaiting Railway Response Railway • 21 days ago
21 days ago
Thanks for reaching out. We sincerely apologize for the service disruption.
We're seeing recovery in our API, builds, and deployments. I just redeployed the linked services and they all seem to have recovered. If other services are having issues, a redeploy should solve them as well. We'll publish a public postmortem covering what happened when we're fully recovered.
For all customers, we’ll publish a detailed postmortem outlining what happened and the steps we’re taking to prevent similar incidents in the future. For Enterprise customers, service credits are covered under our SLA and will be reviewed as part of our post-incident process.
Again, we're deeply sorry you were affected by this.
Please reach out if you're still having issues.
Status changed to Awaiting User Response Railway • 21 days ago
mykal
Thanks for reaching out. We sincerely apologize for the service disruption. We're seeing recovery in our API, builds, and deployments. I just redeployed the linked services and they all seem to have recovered. If other services are having issues, a redeploy should solve them as well. We'll publish a public postmortem covering what happened when we're fully recovered. For all customers, we’ll publish a detailed postmortem outlining what happened and the steps we’re taking to prevent similar incidents in the future. For Enterprise customers, service credits are covered under our SLA and will be reviewed as part of our post-incident process. Again, we're deeply sorry you were affected by this. Please reach out if you're still having issues.
21 days ago
Confirmed back online — /health returning 200, both Postgres and nomo-ai showing Online in the dashboard, application functioning. Thank you for the direct intervention, as you apparently were the one to get things back.
For context: I'm on the Pro plan, and this was a 7h 43min outage on production. Would Railway consider any goodwill service credit for the downtime, even though Pro isn't covered by the Enterprise SLA. Happy to defer to whatever your post-incident review concludes is appropriate. And all the same, thank YOU for helping us get back.
Status changed to Awaiting Railway Response Railway • 21 days ago
21 days ago
Glad everything is back and healthy. Regarding credits, the Pro plan doesn't include an SLA, so formal service credits aren't available.
Status changed to Awaiting User Response Railway • 21 days ago
Status changed to Solved mykal • 21 days ago
