Postgres service stuck CRASHED after GCP outage — volume attached, pg_dump fails via proxy
slaouiss
HOBBYOP

25 days ago

Following the GCP suspension outage of May 19th, my Postgres service

has been in CRASHED state for ~6 hours and won't recover after restart

attempts.

Project: discerning-flow

Environment: production

Service: Postgres

Region: EU West

Image: ghcr.io/railwayapp-templates/postgres-ssl:18

Volume: postgres-volume (attached, still present in UI)

What I've tried:

  • Restart deployment from UI: container goes back to CRASHED immediately

  • pg_dump via public proxy (monorail.proxy.rlwy.net:37808):

    connection accepted then closed immediately by server

    ("server closed the connection unexpectedly")

I have NOT attempted a redeploy with a different image tag yet,

because I have no fresh backup and I'm worried about volume integrity

given the reports of corrupted volumes in the incident thread.

My production chatbot (chatbot-pme service, depends on this DB) is

also down as a consequence.

Could you check the volume state and advise on the safest recovery

path? Happy to pin to a known-good minor tag (e.g. 17.9 per the

incident thread) if you confirm the volume is intact.

Thanks.

Solved

2 Replies

Railway
BOT

25 days ago

Your Postgres service was running normally before the May 19th outage, completing checkpoints up through 22:09 UTC. After the outage, starting at around 04:59 UTC on May 20th, the container is crash-looping because the image entrypoint binary can't be found ("failed to exec pid1"), which is consistent with the image registry disruption during the incident. The volume is mounting successfully on every restart attempt, which is a positive sign for your data. Since the incident is now resolved, triggering a new deployment (not just a restart) should re-pull the image and resolve the crash loop. You can do this from the service settings by clicking "Redeploy" on the latest deployment's three-dot menu, or by redeploying with your current image tag (postgres-ssl:18).


Status changed to Awaiting User Response Railway 25 days ago


Railway

Your Postgres service was running normally before the [May 19th outage](https://status.railway.com/incident/I23M92U0), completing checkpoints up through 22:09 UTC. After the outage, starting at around 04:59 UTC on May 20th, the container is crash-looping because the image entrypoint binary can't be found ("failed to exec pid1"), which is consistent with the image registry disruption during the incident. The volume is mounting successfully on every restart attempt, which is a positive sign for your data. Since the incident is now resolved, triggering a new deployment (not just a restart) should re-pull the image and resolve the crash loop. You can do this from the service settings by clicking "Redeploy" on the latest deployment's three-dot menu, or by redeploying with your current image tag (postgres-ssl:18).

slaouiss
HOBBYOP

25 days ago

Confirmed: Redeploy (not Restart) on postgres-ssl:18 brought the service back online with data intact. Took ~30s. Volume was fine throughout. Thank you for the clear guidance.


Status changed to Awaiting Railway Response Railway 25 days ago


Status changed to Solved slaouiss 25 days ago


Welcome!

Sign in to your Railway account to join the conversation.

Loading...