Postgres volume resized to 5 GB but recovery still fails with pg_wal xlogtemp no space left on device

xxgatsu2

HOBBYOP

4 months ago

Project: ImpiantiPro Web Staging

Project ID: a2e5e520-9c05-464b-9a42-c87e703ac05b

Environment: production

Postgres service ID: c288e722-f0da-4452-889b-58c7ae96f972

Volume ID: 048a2936-ecdc-4480-a62d-14632d4974a2

Volume name: postgres-volume

Workspace plan: Hobby

Timeline:

- On March 7, 2026 around 18:12 UTC Postgres started crashing during recovery with:

FATAL: could not write to file "pg_wal/xlogtemp.30": No space left on device

- We upgraded the workspace to Hobby and resized the Postgres volume from 500 MB to 1 GB, then from 1 GB to 5 GB.

- Railway UI and CLI both now show postgres-volume sizeMB=5000 and currentSizeMB about 203.6 MB.

- Despite this, every Postgres restart still loops in recovery and fails with the same error.

Current repeated logs after the 5 GB resize:

- database system was not properly shut down; automatic recovery in progress

- redo starts at 0/761BF38

- redo done at 0/17FFFFA8

- FATAL: could not write to file "pg_wal/xlogtemp.30": No space left on device

- shutting down due to startup process failure

The backend depends on this DB and cannot start because Postgres never reaches a consistent recovery state.

This looks like the volume resize was not actually applied to the filesystem used by pg_wal, or the volume/filesystem is stuck in a bad recovery state. Can the Railway team please inspect and manually repair this volume / filesystem resize state? If recovery is not possible, please advise the safest platform-level recovery path.

This is the production database and there are no available backups on the current plan.

Solved

2 Replies

Status changed to Awaiting Railway Response Railway • 4 months ago

xxgatsu2

HOBBYOP

4 months ago

Update as of March 8, 2026:

I retried the recovery flow today.

- I redeployed the Postgres service (`ghcr.io/railwayapp-templates/postgres-ssl:17`). It briefly came back online, then crashed again.

- The Postgres logs still show the same recovery failure after redo completes:

`FATAL: could not write to file "pg_wal/xlogtemp.30": No space left on device`

- The current logs also include:

`database system was interrupted while in recovery`

`HINT: This probably means that some data is corrupted and you will have to use the last backup for recovery.`

After Postgres briefly came back, I redeployed the backend.

- The backend image builds successfully.

- The deployment then fails its healthcheck because the app cannot reach the internal Postgres host.

- Current backend deploy logs show:

`psycopg2.OperationalError: could not translate host name "postgres.railway.internal" to address: Temporary failure in name resolution`

This still looks like a Railway-side Postgres / volume recovery problem rather than an application code issue.

Can you please check the volume / filesystem state on your side and confirm whether this service can be repaired, or whether the only safe path is a platform-level restore / replacement?

sam-a

EMPLOYEE

4 months ago

Hey! We took a look at your volume and confirmed the issue. The volume was resized to 5 GB at the ZFS level, but the ext4 filesystem inside it was never expanded to match. So Postgres was still limited to the original ~500 MB of space, which is why WAL recovery kept failing with "No space left on device."

We've now expanded the filesystem to the full 5 GB and redeployed the Postgres service. It should be able to complete WAL recovery and come back online. Let us know if you need more help.

Status changed to Awaiting User Response Railway • 4 months ago

Railway

BOT

4 months ago

This thread has been marked as solved automatically due to a lack of recent activity. Please re-open this thread or create a new one if you require further assistance. Thank you!

Status changed to Solved Railway • 4 months ago

Welcome!