a month ago
Hello,
our managed Postgres instance (postgres-u68c, service ID postgres-u68c-production) is completely unrecoverable and we need urgent provider-side intervention.
== Timeline ==
1. Large DELETE operation filled the Postgres volume → ENOSPC on pg_wal/xlogtemp.*
2. We performed Live Resize to 250 GB via Railway UI
3. After resize, volume got new bind-mount UUID (3d697925-b8e1-4422-9b9e-d2ad757761dc)
4. Container now fails before Postgres even starts:
ERROR (catatonit:2): failed to exec pid1: No such file or directory
== Key details ==
- Volume ID: vol_0cbdiv92rxepmc0l
- Old bind-mount UUID: d35196cc-a044-48fd-87a2-70200c1e16c5
- New bind-mount UUID: 3d697925-b8e1-4422-9b9e-d2ad757761dc
- Region: europe-west4-drams3a
- PostgreSQL version: 17.9
- We have NO backups (were on Hobby plan)
- All app services are offline (no new writes)
== What we need ==
1. Verify data integrity on volume vol_0cbdiv92rxepmc0l
2. Fix the container/entrypoint provisioning issue caused by Live Resize
so Postgres can start
3. If container cannot be recovered: extract/snapshot data from volume
and restore to a new Postgres service
4. This is production data — please treat as URGENT
Time of incident: approximately April 10, 2026 ~19:00 UTC
Update1: bind-mount UUID is not stable and changes between restart attempts.
Observed values:
- d35196cc-a044-48fd-87a2-70200c1e16c5
- 3d697925-b8e1-4422-9b9e-d2ad757761dc
- 4dc31b8c-ed74-4d4a-ab2d-8e692c057701
Volume ID remains constant: vol_0cbdiv92rxepmc0l.
The startup failure is still:
ERROR (catatonit:2): failed to exec pid1: No such file or directory
Update2:
- On image ghcr.io/railwayapp-templates/postgres-ssl:17, PostgreSQL starts recovery and consistently fails with: FATAL: could not write to file "pg_wal/xlogtemp.30": No space left on device
- On image ghcr.io/railwayapp-templates/postgres-ssl:17.9, container fails before PostgreSQL starts: ERROR (catatonit:2): failed to exec pid1: No such file or directory
This indicates two separate platform-level issues:
- image/runtime startup issue on 17.9
- storage ENOSPC during recovery on 17
Please proceed with provider-side recovery on volume vol_0cbdiv92rxepmc0l and avoid requiring customer-side redeploy loops.
Pinned Solution
a month ago
Try this:
- Use version 17 instead of version 17.9, there may be conflicts with how it reads data.
- This step is optional, but highly recommended: Backup the current volume in case something fatal happens. (Postgres service -> Backups -> New backup)
- Set your Postgres start command to be
sleep infinityand redeploy. - SSH into the container by clicking "Copy SSH Command" when right clicking Postgres
- Run
su postgres, followed bypg_resetwal -f /var/lib/postgresql/data/pgdata- There should be a log that says "Write-ahead log reset"
- Remove the custom start command and redeploy
6 Replies
Status changed to Awaiting Railway Response Railway • about 1 month ago
a month ago
This thread has been marked as public for community involvement, as it does not contain any sensitive or personal information. Any further activity in this thread will be visible to everyone.
Status changed to Open Railway • about 1 month ago
0x5b62656e5d
Have you tried redeploying the service?
a month ago
Yes, after redeploying it immediatelly crash with this log:
2026-04-12T14:32:13.114376306Z [err] ERROR (catatonit:2): failed to exec pid1: No such file or directory 2026-04-12T14:32:13.332399473Z [inf] Mounting volume on: /var/lib/containers/railwayapp/bind-mounts/a3e22271-deed-429c-a158-56585e3f2c95/vol_0cbdiv92rxepmc0l 2026-04-12T14:32:13.332497330Z [inf] Starting Container 2026-04-12T14:32:13.332581836Z [inf] Mounting volume on: /var/lib/containers/railwayapp/bind-mounts/a3e22271-deed-429c-a158-56585e3f2c95/vol_0cbdiv92rxepmc0l 2026-04-12T14:32:13.426272471Z [err] ERROR (catatonit:2): failed to exec pid1: No such file or directory 2026-04-12T14:32:14.397055200Z [inf] Mounting volume on: /var/lib/containers/railwayapp/bind-mounts/a3e22271-deed-429c-a158-56585e3f2c95/vol_0cbdiv92rxepmc0l 2026-04-12T14:32:14.797874420Z [err] ERROR (catatonit:2): failed to exec pid1: No such file or directory 2026-04-12T14:32:15.415967488Z [inf] Mounting volume on: /var/lib/containers/railwayapp/bind-mounts/a3e22271-deed-429c-a158-56585e3f2c95/vol_0cbdiv92rxepmc0l 2026-04-12T14:32:15.743216265Z [err] ERROR (catatonit:2): failed to exec pid1: No such file or directory
a month ago
Update:
- On image ghcr.io/railwayapp-templates/postgres-ssl:17, PostgreSQL starts recovery and consistently fails with: FATAL: could not write to file "pg_wal/xlogtemp.30": No space left on device
- On image ghcr.io/railwayapp-templates/postgres-ssl:17.9, container fails before PostgreSQL starts: ERROR (catatonit:2): failed to exec pid1: No such file or directory
This indicates two separate platform-level issues:
- image/runtime startup issue on 17.9
- storage ENOSPC during recovery on 17
Please proceed with provider-side recovery on volume vol_0cbdiv92rxepmc0l and avoid requiring customer-side redeploy loops.
a month ago
Try this:
- Use version 17 instead of version 17.9, there may be conflicts with how it reads data.
- This step is optional, but highly recommended: Backup the current volume in case something fatal happens. (Postgres service -> Backups -> New backup)
- Set your Postgres start command to be
sleep infinityand redeploy. - SSH into the container by clicking "Copy SSH Command" when right clicking Postgres
- Run
su postgres, followed bypg_resetwal -f /var/lib/postgresql/data/pgdata- There should be a log that says "Write-ahead log reset"
- Remove the custom start command and redeploy
a month ago
Hello, quick update and confirmation:
Your proposed recovery procedure worked.
What we did:
- Switched to image version 17
- Created a backup/snapshot first
- Set start command to sleep infinity and redeployed
- Connected via SSH and ran: su postgres pg_resetwal -f /var/lib/postgresql/data/pgdata
- Restored normal start command and redeployed
Current status:
- PostgreSQL starts successfully and reaches ready to accept connections
- No recurring ENOSPC error on pg_wal/xlogtemp
- Checkpoints are completing successfully
- Backend and frontend are running
- Critical user flows were tested and are working
- We also created a logical dump as an extra safety backup
Thank you, this resolved the incident on our side.
If you recommend any post-recovery checks after pg_resetwal, please share. Otherwise we can consider this ticket resolved.
Best regards
a month ago
IIRC I'd just verify and make sure the database itself isn't corrupted through your application or a DB client such as Datagrip. Otherwise, great!
Status changed to Solved 0x5b62656e5d • about 1 month ago
