URGENT: Postgres recovery fails with ENOSPC; postgres-ssl:17.9 also fails with catatonit/pid1
kajabednar12-cpu
PROOP

a month ago

Hello,

our managed Postgres instance (postgres-u68c, service ID postgres-u68c-production) is completely unrecoverable and we need urgent provider-side intervention.

== Timeline ==

1. Large DELETE operation filled the Postgres volume → ENOSPC on pg_wal/xlogtemp.*

2. We performed Live Resize to 250 GB via Railway UI

3. After resize, volume got new bind-mount UUID (3d697925-b8e1-4422-9b9e-d2ad757761dc)

4. Container now fails before Postgres even starts:

ERROR (catatonit:2): failed to exec pid1: No such file or directory

== Key details ==

- Volume ID: vol_0cbdiv92rxepmc0l

- Old bind-mount UUID: d35196cc-a044-48fd-87a2-70200c1e16c5

- New bind-mount UUID: 3d697925-b8e1-4422-9b9e-d2ad757761dc

- Region: europe-west4-drams3a

- PostgreSQL version: 17.9

- We have NO backups (were on Hobby plan)

- All app services are offline (no new writes)

== What we need ==

1. Verify data integrity on volume vol_0cbdiv92rxepmc0l

2. Fix the container/entrypoint provisioning issue caused by Live Resize

so Postgres can start

3. If container cannot be recovered: extract/snapshot data from volume

and restore to a new Postgres service

4. This is production data — please treat as URGENT

Time of incident: approximately April 10, 2026 ~19:00 UTC

Update1: bind-mount UUID is not stable and changes between restart attempts.

Observed values:

- d35196cc-a044-48fd-87a2-70200c1e16c5

- 3d697925-b8e1-4422-9b9e-d2ad757761dc

- 4dc31b8c-ed74-4d4a-ab2d-8e692c057701

Volume ID remains constant: vol_0cbdiv92rxepmc0l.

The startup failure is still:

ERROR (catatonit:2): failed to exec pid1: No such file or directory

Update2:

This indicates two separate platform-level issues:

  1. image/runtime startup issue on 17.9
  2. storage ENOSPC during recovery on 17

Please proceed with provider-side recovery on volume vol_0cbdiv92rxepmc0l and avoid requiring customer-side redeploy loops.

Solved$10 Bounty

Pinned Solution

Try this:

  1. Use version 17 instead of version 17.9, there may be conflicts with how it reads data.
  2. This step is optional, but highly recommended: Backup the current volume in case something fatal happens. (Postgres service -> Backups -> New backup)
  3. Set your Postgres start command to be sleep infinity and redeploy.
  4. SSH into the container by clicking "Copy SSH Command" when right clicking Postgres
  5. Run su postgres, followed by pg_resetwal -f /var/lib/postgresql/data/pgdata
    1. There should be a log that says "Write-ahead log reset"
  6. Remove the custom start command and redeploy

6 Replies

Status changed to Awaiting Railway Response Railway about 1 month ago


Railway
BOT

a month ago

This thread has been marked as public for community involvement, as it does not contain any sensitive or personal information. Any further activity in this thread will be visible to everyone.

Status changed to Open Railway about 1 month ago


Have you tried redeploying the service?


0x5b62656e5d

Have you tried redeploying the service?

kajabednar12-cpu
PROOP

a month ago

Yes, after redeploying it immediatelly crash with this log:

2026-04-12T14:32:13.114376306Z [err] ERROR (catatonit:2): failed to exec pid1: No such file or directory 2026-04-12T14:32:13.332399473Z [inf] Mounting volume on: /var/lib/containers/railwayapp/bind-mounts/a3e22271-deed-429c-a158-56585e3f2c95/vol_0cbdiv92rxepmc0l 2026-04-12T14:32:13.332497330Z [inf] Starting Container 2026-04-12T14:32:13.332581836Z [inf] Mounting volume on: /var/lib/containers/railwayapp/bind-mounts/a3e22271-deed-429c-a158-56585e3f2c95/vol_0cbdiv92rxepmc0l 2026-04-12T14:32:13.426272471Z [err] ERROR (catatonit:2): failed to exec pid1: No such file or directory 2026-04-12T14:32:14.397055200Z [inf] Mounting volume on: /var/lib/containers/railwayapp/bind-mounts/a3e22271-deed-429c-a158-56585e3f2c95/vol_0cbdiv92rxepmc0l 2026-04-12T14:32:14.797874420Z [err] ERROR (catatonit:2): failed to exec pid1: No such file or directory 2026-04-12T14:32:15.415967488Z [inf] Mounting volume on: /var/lib/containers/railwayapp/bind-mounts/a3e22271-deed-429c-a158-56585e3f2c95/vol_0cbdiv92rxepmc0l 2026-04-12T14:32:15.743216265Z [err] ERROR (catatonit:2): failed to exec pid1: No such file or directory


kajabednar12-cpu
PROOP

a month ago

Update:

This indicates two separate platform-level issues:

  1. image/runtime startup issue on 17.9
  2. storage ENOSPC during recovery on 17

Please proceed with provider-side recovery on volume vol_0cbdiv92rxepmc0l and avoid requiring customer-side redeploy loops.


Try this:

  1. Use version 17 instead of version 17.9, there may be conflicts with how it reads data.
  2. This step is optional, but highly recommended: Backup the current volume in case something fatal happens. (Postgres service -> Backups -> New backup)
  3. Set your Postgres start command to be sleep infinity and redeploy.
  4. SSH into the container by clicking "Copy SSH Command" when right clicking Postgres
  5. Run su postgres, followed by pg_resetwal -f /var/lib/postgresql/data/pgdata
    1. There should be a log that says "Write-ahead log reset"
  6. Remove the custom start command and redeploy

kajabednar12-cpu
PROOP

a month ago

Hello, quick update and confirmation:

Your proposed recovery procedure worked.

What we did:

  1. Switched to image version 17
  2. Created a backup/snapshot first
  3. Set start command to sleep infinity and redeployed
  4. Connected via SSH and ran: su postgres pg_resetwal -f /var/lib/postgresql/data/pgdata
  5. Restored normal start command and redeployed

Current status:

  • PostgreSQL starts successfully and reaches ready to accept connections
  • No recurring ENOSPC error on pg_wal/xlogtemp
  • Checkpoints are completing successfully
  • Backend and frontend are running
  • Critical user flows were tested and are working
  • We also created a logical dump as an extra safety backup

Thank you, this resolved the incident on our side.

If you recommend any post-recovery checks after pg_resetwal, please share. Otherwise we can consider this ticket resolved.

Best regards


IIRC I'd just verify and make sure the database itself isn't corrupted through your application or a DB client such as Datagrip. Otherwise, great!


Status changed to Solved 0x5b62656e5d about 1 month ago


Welcome!

Sign in to your Railway account to join the conversation.

Loading...