PostgreSQL service down — database unreachable, app returning 500 errors

tingmu99

PROOP

a month ago

My PostgreSQL service is in an unexpected state and unreachable. The Railway dashboard shows "Attempting to connect to the database via SSH: our application is not running or in an unexpected state" when I try to access it.

My app service is running but cannot connect to the database, resulting in 500 errors for all users. The Railway logs show repeated ETIMEDOUT errors on port 5432:

[startup] Defensive table creation failed:

Failed to prune sessions: AggregateError [ETIMEDOUT]

Error: connect ETIMEDOUT 10.172.205.1:5432

The app starts successfully but cannot complete startup without the database connection.

Project: virtual-legal-receptionist

Project ID: 8a9afc93-1a60-4be7-ae48-4405774eb778

Service: PostgreSQL

Started: approximately 2026-05-27 around 22:00 UTC

This is a production application serving live users. Please advise on status and ETA for recovery.

Thank you.

$20 Bounty

7 Replies

Status changed to Awaiting Railway Response Railway • 28 days ago

Railway

BOT

a month ago

Your Postgres service was redeployed at approximately 04:57 UTC and is now showing a successful deployment status. If your app is still seeing ETIMEDOUT errors, you may need to redeploy the app service as well so it re-establishes the database connection. You can trigger a redeploy from the three-dot menu on the app service in your project dashboard. If the database itself is still unreachable after the redeploy, open the Postgres service, press Cmd+K (or Ctrl+K), and select "Redeploy source image" to pull a fresh image.

Status changed to Awaiting User Response Railway • 28 days ago

tingmu99

PROOP

a month ago

The issue is not resolved. Two separate problems persist:

Problem 1 — Volume mount crash loop (unaddressed):

ERROR (catatonit:2): failed to exec pid1: No such file or directory

Mounting volume on: .../vol_ol0riownzakybtxn

This volume has been failing to mount since yesterday evening and is causing my app container to crash repeatedly. This was never addressed in your previous response.

Problem 2 — Postgres IP keeps changing: My app is seeing ETIMEDOUT on three different internal IPs in the last hour:

10.172.205.1

10.205.100.53

10.207.98.247

The database appears to be migrating between hosts. My DATABASE_URL environment variable may need to be updated to reflect the current Postgres location, or the internal networking between my app service and Postgres service needs to be restored.

Project ID: 8a9afc93-1a60-4be7-ae48-4405774eb778

Both issues require action on your end. Please advise urgently.

Status changed to Awaiting Railway Response Railway • 28 days ago

Railway

BOT

a month ago

Your Postgres service is still crash-looping with a stale container image (the "failed to exec pid1" errors in our logs confirm this), and that is also the cause of the changing IPs and your app's ETIMEDOUT errors - the database keeps restarting across hosts because it cannot start. Your volume data is intact (224 MB used). To fix this, open the Postgres service, press Cmd+K (or Ctrl+K) to open the command palette, and select "Redeploy source image" - this re-pulls a fresh image, which a normal redeploy does not do. Once Postgres is healthy, redeploy your app service so it re-establishes the connection.

Status changed to Awaiting User Response Railway • 28 days ago

Railway

tingmu99

PROOP

a month ago

have already tried "Redeploy source image" on the Postgres service and it did not resolve the issue. The crash loop continues.

This has now been down for over 18 hours. I need escalated support, not the same generic troubleshooting steps repeated.

Specifically I need someone to:

Look at the actual state of volume vol_ol0riownzakybtxn on your infrastructure

Explain why the Postgres container cannot exec pid1 after a source image redeploy

Either restore the volume to a working state or migrate my data to a new Postgres instance

Project ID: 8a9afc93-1a60-4be7-ae48-4405774eb778

This is a production application that has been down since approximately 22:00 UTC on May 27, 2026.

Status changed to Awaiting Railway Response Railway • 28 days ago

tingmu99

PROOP

a month ago

Subject: Container crash loop persists after 24+ hours — volume mount issue blocking production

I am writing to provide full context as this issue has been ongoing since approximately 22:00 UTC on May 27, 2026.

What happened:

My Postgres service crashed yesterday evening due to what appears to have been a Railway infrastructure incident. The app service simultaneously began crash-looping with ERROR (catatonit:2): failed to exec pid1: No such file or directory on volume vol_ol0riownzakybtxn. Both issues appeared to be related to Railway's infrastructure, not my code.

What I've tried per your instructions:

Redeployed the app service — did not resolve

Redeployed Postgres source image — did not resolve

You did not instruct me to do this, but I also created a brand new Postgres service with a fresh volume , and the database is now healthy and connecting successfully. But this did not solve the issue.

Current state:

The database issue is resolved. My app now connects to the new Postgres successfully and bootstraps its schema correctly. I can see this in the startup logs:

✅ [db] Connected to PostgreSQL

[startup] Admin account seeded: admin@houji.ai

[startup] Defensive table check complete — base tables (firms, admins, firm_config...)

The remaining problem:

Despite the successful startup, Railway's container runtime kills the app approximately 7 seconds after it starts with SIGTERM. The catatonit volume mount crash loop on vol_ol0riownzakybtxn is still occurring on the app service. This volume appears to be incorrectly attached to my app service and is causing the container to crash before it can serve HTTP traffic.

The app starts correctly, connects to the database, completes initialization — and then Railway kills it. My code is correct. My database is correct. The problem is entirely in Railway's container runtime.

What I need: Please inspect and remove or repair volume vol_ol0riownzakybtxn from my app service — I do not use volumes in my application and this attachment appears to be erroneous

Confirm why my app container is receiving SIGTERM 7 seconds after a successful startup

Project ID: 8a9afc93-1a60-4be7-ae48-4405774eb778

App service: virtual-legal-receptionist

Duration of outage: 24+ hours

This is a production legal services application. Please escalate to infrastructure engineering.

sam-a

EMPLOYEE

a month ago

Your app's logs show it connecting to the new Postgres successfully and then exiting again and again because a table it queries on startup, the session table, does not exist in the newly created database. The roughly 7-second restarts are the app crashing on that error, not our runtime stopping a healthy container. The new database you created starts empty, so the schema your app expects is not present in it yet.

This is on the application and database side rather than our infrastructure. I can point you to the community where folks can help you get the session table and the rest of your schema created in the new database.

Status changed to Awaiting User Response Railway • 27 days ago

Status changed to Open sam-a • 27 days ago

sam-a

There is no volume attached to your app service. Virtual-Legal-Receptionist runs on ephemeral storage, so there is nothing on it to remove or repair. The volume you referenced is attached to your original Postgres service, not the app. Your app's logs show it connecting to the new Postgres successfully and then exiting again and again because a table it queries on startup, the session table, does not exist in the newly created database. The roughly 7-second restarts are the app crashing on that error, not our runtime stopping a healthy container. The new database you created starts empty, so the schema your app expects is not present in it yet. This is on the application and database side rather than our infrastructure. I can point you to the community where folks can help you get the session table and the rest of your schema created in the new database.

tingmu99

PROOP

a month ago

Thank you for the response, though it arrived after I had already spent 24+ hours troubleshooting, migrated my entire application to GCP Cloud Run + Cloud SQL, and cancelled my Railway subscription.

To clarify the timeline: the original Postgres service with volume vol_ol0riownzakybtxn never recovered and was never addressed by your team. The session table issue was a secondary problem that emerged when I created a new Postgres service as a workaround — and yes, that was on the application side, which I fixed myself.

The core issue was that a paid Pro customer with a production application down for 24+ hours received no infrastructure-level support. The bounty response directing me to the community was not appropriate for an infrastructure incident of this nature.

I've migrated to GCP and cancelled my subscription. I hope Railway improves its support response times for production incidents in the future.

Welcome!