Production Postgres crashed and volume may need recovery

flamerevenge

HOBBYOP

a month ago

Our production PostgreSQL service on Railway stopped starting unexpectedly.

Service details:

Project: RIP-Tear-api

Environment: production

Service: Postgres

Source image: ghcr.io/railwayapp-templates/postgres-ssl:17

Current behavior:

The Postgres service is stuck in a crash loop.

Deploy logs show:

Mounting volume on: /var/lib/containers/railwayapp/bind-mounts/.../vol_...

ERROR (catatonit:2): failed to exec pid1: No such file or directory

The volume still exists and is attached to the service.

There are no backups available in the Backups tab.

Important context:

We previously saw some application-level SQL errors caused by invalid UUID input from our API, but those were normal query errors and should not prevent PostgreSQL from starting.

The current issue appears to happen before PostgreSQL itself starts, during container startup.

We do not have a custom start command configured for this Postgres service.

The service still points to the default Railway Postgres image and variables look default.

What we need help with:

Please help recover data from the existing volume.

If possible, please reattach or migrate the existing volume to a healthy PostgreSQL instance.

If this is a broken deployment/image issue, please advise the safest recovery path without losing the current volume data.

This is a production database, so we want to avoid any action that could destroy the existing volume or make recovery harder.

Solved

2 Replies

Railway

BOT

a month ago

The catatonit: failed to exec pid1 error is a known issue caused by a stale container image, typically after host-level disruptions, and does not indicate volume data loss. To resolve it, open your Postgres service, use the command palette (Cmd/Ctrl+K), and select "Redeploy source image" to re-pull a fresh image. A normal redeploy from the three-dot menu will not work because it reuses the cached image. If the service does not recover after that, let us know and we can look into the volume directly.

Status changed to Awaiting User Response Railway • about 1 month ago

Railway

The `catatonit: failed to exec pid1` error is a known issue caused by a stale container image, typically after host-level disruptions, and does not indicate volume data loss. To resolve it, open your Postgres service, use the command palette (Cmd/Ctrl+K), and select "Redeploy source image" to re-pull a fresh image. A normal redeploy from the three-dot menu will not work because it reuses the cached image. If the service does not recover after that, let us know and we can look into the volume directly.

flamerevenge

HOBBYOP

a month ago

Thanks it work!

Status changed to Awaiting Railway Response Railway • about 1 month ago

Status changed to Solved flamerevenge • about 1 month ago

Welcome!