Following the GCP suspension outage of May 19th, my Postgres service has been in CRASHED state for ~6 hours and won't recover after restart attempts. Project: discerning-flow Environment: production Service: Postgres Region: EU West Image: ghcr.io/railwayapp-templates/postgres-ssl:18 Volume: postgres-volume (attached, still present in UI) What I've tried: - Restart deployment from UI: container goes back to CRASHED immediately - pg_dump via public proxy (monorail.proxy.rlwy.net:37808): connection accepted then closed immediately by server ("server closed the connection unexpectedly") I have NOT attempted a redeploy with a different image tag yet, because I have no fresh backup and I'm worried about volume integrity given the reports of corrupted volumes in the incident thread. My production chatbot (chatbot-pme service, depends on this DB) is also down as a consequence. Could you check the volume state and advise on the safest recovery path? Happy to pin to a known-good minor tag (e.g. 17.9 per the incident thread) if you confirm the volume is intact. Thanks.

Postgres service stuck CRASHED after GCP outage — volume attached, pg_dump fails via proxy

slaouiss

HOBBYOP

a month ago

Following the GCP suspension outage of May 19th, my Postgres service

has been in CRASHED state for ~6 hours and won't recover after restart

attempts.

Project: discerning-flow

Environment: production

Service: Postgres

Region: EU West

Image: ghcr.io/railwayapp-templates/postgres-ssl:18

Volume: postgres-volume (attached, still present in UI)

What I've tried:

Restart deployment from UI: container goes back to CRASHED immediately
pg_dump via public proxy (monorail.proxy.rlwy.net:37808):

connection accepted then closed immediately by server

("server closed the connection unexpectedly")

I have NOT attempted a redeploy with a different image tag yet,

because I have no fresh backup and I'm worried about volume integrity

given the reports of corrupted volumes in the incident thread.

My production chatbot (chatbot-pme service, depends on this DB) is

also down as a consequence.

Could you check the volume state and advise on the safest recovery

path? Happy to pin to a known-good minor tag (e.g. 17.9 per the

incident thread) if you confirm the volume is intact.

Thanks.

Solved

2 Replies

Railway

BOT

a month ago

Your Postgres service was running normally before the May 19th outage, completing checkpoints up through 22:09 UTC. After the outage, starting at around 04:59 UTC on May 20th, the container is crash-looping because the image entrypoint binary can't be found ("failed to exec pid1"), which is consistent with the image registry disruption during the incident. The volume is mounting successfully on every restart attempt, which is a positive sign for your data. Since the incident is now resolved, triggering a new deployment (not just a restart) should re-pull the image and resolve the crash loop. You can do this from the service settings by clicking "Redeploy" on the latest deployment's three-dot menu, or by redeploying with your current image tag (postgres-ssl:18).

Status changed to Awaiting User Response Railway • about 1 month ago

Railway

Your Postgres service was running normally before the [May 19th outage](https://status.railway.com/incident/I23M92U0), completing checkpoints up through 22:09 UTC. After the outage, starting at around 04:59 UTC on May 20th, the container is crash-looping because the image entrypoint binary can't be found ("failed to exec pid1"), which is consistent with the image registry disruption during the incident. The volume is mounting successfully on every restart attempt, which is a positive sign for your data. Since the incident is now resolved, triggering a new deployment (not just a restart) should re-pull the image and resolve the crash loop. You can do this from the service settings by clicking "Redeploy" on the latest deployment's three-dot menu, or by redeploying with your current image tag (postgres-ssl:18).

slaouiss

HOBBYOP

a month ago

Confirmed: Redeploy (not Restart) on postgres-ssl:18 brought the service back online with data intact. Took ~30s. Volume was fine throughout. Thank you for the clear guidance.

Status changed to Awaiting Railway Response Railway • about 1 month ago

Status changed to Solved slaouiss • about 1 month ago

Welcome!