URGENT: Postgres recovery fails with ENOSPC; postgres-ssl:17.9 also fails with catatonit/pid1

kajabednar12-cpu

PROOP

3 months ago

Hello,

our managed Postgres instance (postgres-u68c, service ID postgres-u68c-production) is completely unrecoverable and we need urgent provider-side intervention.

== Timeline ==

1. Large DELETE operation filled the Postgres volume → ENOSPC on pg_wal/xlogtemp.*

2. We performed Live Resize to 250 GB via Railway UI

3. After resize, volume got new bind-mount UUID (3d697925-b8e1-4422-9b9e-d2ad757761dc)

4. Container now fails before Postgres even starts:

ERROR (catatonit:2): failed to exec pid1: No such file or directory

== Key details ==

- Volume ID: vol_0cbdiv92rxepmc0l

- Old bind-mount UUID: d35196cc-a044-48fd-87a2-70200c1e16c5

- New bind-mount UUID: 3d697925-b8e1-4422-9b9e-d2ad757761dc

- Region: europe-west4-drams3a

- PostgreSQL version: 17.9

- We have NO backups (were on Hobby plan)

- All app services are offline (no new writes)

== What we need ==

1. Verify data integrity on volume vol_0cbdiv92rxepmc0l

2. Fix the container/entrypoint provisioning issue caused by Live Resize

so Postgres can start

3. If container cannot be recovered: extract/snapshot data from volume

and restore to a new Postgres service

4. This is production data — please treat as URGENT

Time of incident: approximately April 10, 2026 ~19:00 UTC

Update1: bind-mount UUID is not stable and changes between restart attempts.

Observed values:

- d35196cc-a044-48fd-87a2-70200c1e16c5

- 3d697925-b8e1-4422-9b9e-d2ad757761dc

- 4dc31b8c-ed74-4d4a-ab2d-8e692c057701

Volume ID remains constant: vol_0cbdiv92rxepmc0l.

The startup failure is still:

ERROR (catatonit:2): failed to exec pid1: No such file or directory

Update2:

On image ghcr.io/railwayapp-templates/postgres-ssl:17, PostgreSQL starts recovery and consistently fails with: FATAL: could not write to file "pg_wal/xlogtemp.30": No space left on device
On image ghcr.io/railwayapp-templates/postgres-ssl:17.9, container fails before PostgreSQL starts: ERROR (catatonit:2): failed to exec pid1: No such file or directory

This indicates two separate platform-level issues:

image/runtime startup issue on 17.9
storage ENOSPC during recovery on 17

Please proceed with provider-side recovery on volume vol_0cbdiv92rxepmc0l and avoid requiring customer-side redeploy loops.

Solved$10 Bounty

Pinned Solution

0x5b62656e5d

MODERATOR

3 months ago

Try this:

Use version 17 instead of version 17.9, there may be conflicts with how it reads data.
This step is optional, but highly recommended: Backup the current volume in case something fatal happens. (Postgres service -> Backups -> New backup)
Set your Postgres start command to be sleep infinity and redeploy.
SSH into the container by clicking "Copy SSH Command" when right clicking Postgres
Run su postgres, followed by pg_resetwal -f /var/lib/postgresql/data/pgdata
1. There should be a log that says "Write-ahead log reset"
Remove the custom start command and redeploy

6 Replies

Status changed to Awaiting Railway Response Railway • 3 months ago

Railway

BOT

3 months ago

This thread has been marked as public for community involvement, as it does not contain any sensitive or personal information. Any further activity in this thread will be visible to everyone.

Status changed to Open Railway • 3 months ago

0x5b62656e5d

MODERATOR

3 months ago

Have you tried redeploying the service?

0x5b62656e5d

Have you tried redeploying the service?

kajabednar12-cpu

PROOP

3 months ago

Yes, after redeploying it immediatelly crash with this log:

2026-04-12T14:32:13.114376306Z [err] ERROR (catatonit:2): failed to exec pid1: No such file or directory 2026-04-12T14:32:13.332399473Z [inf] Mounting volume on: /var/lib/containers/railwayapp/bind-mounts/a3e22271-deed-429c-a158-56585e3f2c95/vol_0cbdiv92rxepmc0l 2026-04-12T14:32:13.332497330Z [inf] Starting Container 2026-04-12T14:32:13.332581836Z [inf] Mounting volume on: /var/lib/containers/railwayapp/bind-mounts/a3e22271-deed-429c-a158-56585e3f2c95/vol_0cbdiv92rxepmc0l 2026-04-12T14:32:13.426272471Z [err] ERROR (catatonit:2): failed to exec pid1: No such file or directory 2026-04-12T14:32:14.397055200Z [inf] Mounting volume on: /var/lib/containers/railwayapp/bind-mounts/a3e22271-deed-429c-a158-56585e3f2c95/vol_0cbdiv92rxepmc0l 2026-04-12T14:32:14.797874420Z [err] ERROR (catatonit:2): failed to exec pid1: No such file or directory 2026-04-12T14:32:15.415967488Z [inf] Mounting volume on: /var/lib/containers/railwayapp/bind-mounts/a3e22271-deed-429c-a158-56585e3f2c95/vol_0cbdiv92rxepmc0l 2026-04-12T14:32:15.743216265Z [err] ERROR (catatonit:2): failed to exec pid1: No such file or directory

kajabednar12-cpu

PROOP

3 months ago

Update:

On image ghcr.io/railwayapp-templates/postgres-ssl:17, PostgreSQL starts recovery and consistently fails with: FATAL: could not write to file "pg_wal/xlogtemp.30": No space left on device
On image ghcr.io/railwayapp-templates/postgres-ssl:17.9, container fails before PostgreSQL starts: ERROR (catatonit:2): failed to exec pid1: No such file or directory

This indicates two separate platform-level issues:

image/runtime startup issue on 17.9
storage ENOSPC during recovery on 17

Please proceed with provider-side recovery on volume vol_0cbdiv92rxepmc0l and avoid requiring customer-side redeploy loops.

0x5b62656e5d

MODERATOR

3 months ago

Try this:

Use version 17 instead of version 17.9, there may be conflicts with how it reads data.
This step is optional, but highly recommended: Backup the current volume in case something fatal happens. (Postgres service -> Backups -> New backup)
Set your Postgres start command to be sleep infinity and redeploy.
SSH into the container by clicking "Copy SSH Command" when right clicking Postgres
Run su postgres, followed by pg_resetwal -f /var/lib/postgresql/data/pgdata
1. There should be a log that says "Write-ahead log reset"
Remove the custom start command and redeploy

kajabednar12-cpu

PROOP

3 months ago

Hello, quick update and confirmation:

Your proposed recovery procedure worked.

What we did:

Switched to image version 17
Created a backup/snapshot first
Set start command to sleep infinity and redeployed
Connected via SSH and ran: su postgres pg_resetwal -f /var/lib/postgresql/data/pgdata
Restored normal start command and redeployed

Current status:

PostgreSQL starts successfully and reaches ready to accept connections
No recurring ENOSPC error on pg_wal/xlogtemp
Checkpoints are completing successfully
Backend and frontend are running
Critical user flows were tested and are working
We also created a logical dump as an extra safety backup

Thank you, this resolved the incident on our side.

If you recommend any post-recovery checks after pg_resetwal, please share. Otherwise we can consider this ticket resolved.

Best regards

0x5b62656e5d

MODERATOR

3 months ago

IIRC I'd just verify and make sure the database itself isn't corrupted through your application or a DB client such as Datagrip. Otherwise, great!

Status changed to Solved 0x5b62656e5d • 3 months ago

Welcome!