Postgres stuck in WAL recovery loop after volume resize to 40GB - filesystem not expanded
benikfar
PROOP

2 months ago

Hi Railway team,

My Postgres service is stuck in crash-recovery loop after the volume became full.

- I resized the postgres-volume to 40 GB (Pro plan) to recover from disk full.

- But Postgres still fails with:

FATAL: could not write to file "pg_wal/xlogtemp.26": No space left on device

- WAL recovery starts but cannot complete because the ext4 filesystem was not expanded (still shows very small size while the block device is now ~40GB).

I only increased it to 40GB temporarily because I believe 10GB would have been enough for my needs.

I want to keep costs as low as possible after recovery.

Data is very important (n8n workflows inside - not exported yet).

Please help expand the filesystem from your end (resize2fs) so Postgres can finish recovery.

Logs from Postgres attached below.

Thank you!

$10 Bounty

11 Replies

Status changed to Awaiting Railway Response Railway about 2 months ago


2 months ago

Hey, thanks for reaching out.

Your volume resize to 40 GB didn't fully apply on our end, which is why Postgres couldn't complete WAL recovery. We've corrected it and redeployed your Postgres service. It should now have approximately 37 GB of available disk space and be able to finish recovery.

We've also shipped a fix to prevent this from happening on future volume resizes.


Status changed to Awaiting User Response Railway about 2 months ago


benikfar
PROOP

2 months ago

Hi

You previously fixed the volume resize and redeployed the Postgres service. Thank you.

However, the service is now failing to start with this repeated error:

ERROR (catatonit:2): failed to exec pid1: No space left on device

Mounting volume on: /var/lib/containers/railwayapp/bind-mounts/.../vol_mffmx6f086vq85ax

The container cannot even start properly.

This happened after the disk-full crash and WAL recovery attempt.

n8n Primary and Worker are also down because they can't connect to the DB.

Data (n8n workflows) is very important and not exported.

Please help recover the Postgres instance or let me know what the current status of the volume/data is.

Thank you!


Status changed to Awaiting Railway Response Railway about 2 months ago


2 months ago

Please provide a direct link to that service.


Status changed to Awaiting User Response Railway about 2 months ago



Status changed to Awaiting Railway Response Railway about 2 months ago


2 months ago

That link is to your project, not the specific service. That said, the volume for your Postgres service is correctly sized at 40GB. The startup failure you're seeing is a configuration issue on your end, not a platform problem. We're going to open this thread up to the community so they can help you get Postgres running again.


Status changed to Awaiting User Response Railway about 2 months ago


Railway
BOT

2 months ago

This thread has been marked as public for community involvement, as it does not contain any sensitive or personal information. Any further activity in this thread will be visible to everyone.

Status changed to Open Railway about 2 months ago


domehane
FREE

2 months ago

Hello benikfar, can you share the full startup logs of the postgres service, not just the catatonit line, and did you make any changes to the postgres service config on railway (env variables, start command, dockerfile, anything) before or after the crash


brody

That link is to your project, not the specific service. That said, the volume for your Postgres service is correctly sized at 40GB. The startup failure you're seeing is a configuration issue on your end, not a platform problem. We're going to open this thread up to the community so they can help you get Postgres running again.

catn568
FREE

2 months ago

Merci. Aidez moi à le faire


domehane

Hello **benikfar,** can you share the full startup logs of the postgres service, not just the catatonit line, and did you make any changes to the postgres service config on railway (env variables, start command, dockerfile, anything) before or after the crash

benikfar
PROOP

2 months ago

Hi,

Here are the full startup logs from the Postgres service:

PostgreSQL Database directory appears to contain a database; Skipping initialization 2026-03-26 11:58:26.717 UTC [3] LOG: starting PostgreSQL 16.13 (Debian 16.13-1.pgdg13+1) on x86_64-pc-linux-gnu, compiled by gcc (Debian 14.2.0-19) 14.2.0, 64-bit 2026-03-26 11:58:26.717 UTC [3] LOG: listening on IPv4 address "0.0.0.0", port 5432 2026-03-26 11:58:26.717 UTC [3] LOG: listening on IPv6 address "::", port 5432 2026-03-26 11:58:26.724 UTC [3] LOG: listening on Unix socket "/var/run/postgresql/.s.PGSQL.5432" 2026-03-26 11:58:26.732 UTC [26] LOG: database system was interrupted while in recovery at 2026-03-26 11:58:20 UTC 2026-03-26 11:58:26.732 UTC [26] HINT: This probably means that some data is corrupted and you will have to use the last backup for recovery. 2026-03-26 11:58:27.212 UTC [26] LOG: database system was not properly shut down; automatic recovery in progress 2026-03-26 11:58:27.216 UTC [26] LOG: redo starts at 2/18277728 2026-03-26 11:58:31.725 UTC [26] LOG: redo done at 2/1EFFE2B8 system usage: CPU: user: 0.03 s, system: 0.13 s, elapsed: 4.50 s 2026-03-26 11:58:31.736 UTC [26] FATAL: could not write to file "pg_wal/xlogtemp.26": No space left on device 2026-03-26 11:58:31.745 UTC [3] LOG: startup process (PID 26) exited with exit code 1 2026-03-26 11:58:31.745 UTC [3] LOG: terminating any other active server processes 2026-03-26 11:58:31.745 UTC [3] LOG: shutting down due to startup process failure 2026-03-26 11:58:31.761 UTC [3] LOG: database system is shut down

Regarding your question:  
I did **not** make any changes to the Postgres service config. No environment variables were added or modified, no start command changed, no custom Dockerfile, nothing. The only thing I did was resize the volume from the default size to 40GB after it filled up.

The crash happened because the volume ran out of space, and even after resizing to 40GB, the filesystem inside wasn't expanded, so Postgres still can't write to pg_wal during recovery.

Let me know if you need any other logs (from Primary or Worker) or more details.

Thanks!

domehane
FREE

2 months ago

good news and bad news from your logs, the good news is redo actually completed successfully ; you can see 'redo done at 2/1EFFE2B8' in the logs, meaning your data is intact and not corrupted. the bad news is postgres then tries to write a new wal checkpoint right after recovery and fails there with the same no space error

this confirms what you suspected , the block device is 40gb but the filesystem inside was never actually expanded (resize2fs was never run). railway probably verified the volume size at the block device level and saw 40gb, but the ext4 filesystem inside can still be the old small size

i think you should ask railway specifically to check the actual filesystem size with df -h inside the volume (not just the block device size), and if it's still the old size, ask them to run resize2fs on it. only they can do this since the container is their infrastructure. your data is safe , postgres just needs a few mb of free filesystem space to write that final checkpoint and it will come up fully

Hope this help you


domehane

good news and bad news from your logs, the good news is redo actually completed successfully ; you can see 'redo done at 2/1EFFE2B8' in the logs, meaning your data is intact and not corrupted. the bad news is postgres then tries to write a new wal checkpoint right after recovery and fails there with the same no space error this confirms what you suspected , the block device is 40gb but the filesystem inside was never actually expanded (resize2fs was never run). railway probably verified the volume size at the block device level and saw 40gb, but the ext4 filesystem inside can still be the old small size i think you should ask railway specifically to check the actual filesystem size with df -h inside the volume (not just the block device size), and if it's still the old size, ask them to run resize2fs on it. only they can do this since the container is their infrastructure. your data is safe , postgres just needs a few mb of free filesystem space to write that final checkpoint and it will come up fully Hope this help you

benikfar
PROOP

2 months ago

Hi,

Thank you so much for checking my logs and explaining everything so clearly! I really appreciate it.

Good news that the redo completed successfully and my data is safe 👍

Bad news is exactly as you said — the filesystem was never expanded after I resized the volume to 40GB.

I'll now contact Railway support and ask them to check the actual filesystem size with df -h and run resize2fs if needed.

Thanks again for your detailed help. It helped me understand the issue much better.

I'll update you once I hear back from them.


brody

That link is to your project, not the specific service. That said, the volume for your Postgres service is correctly sized at 40GB. The startup failure you're seeing is a configuration issue on your end, not a platform problem. We're going to open this thread up to the community so they can help you get Postgres running again.

benikfar
PROOP

2 months ago

My n8n service has been completely down for **3 full days** now and my website is still inaccessible.

As explained by a community member after reviewing my Postgres logs:

- The WAL recovery actually completed successfully ("redo done at 2/1EFFE2B8"), so my data is intact.

- However, Postgres fails right after when trying to write the new WAL checkpoint because the filesystem was never expanded.

I resized the postgres-volume to 40GB, but the ext4 filesystem inside is still the old small size. This is why I'm getting the "No space left on device" error even after resize.

Please check the **actual filesystem size** inside the volume using df -h (not just the block device size), and if it's still small, **run resize2fs** on it so it uses the full 40GB.

This is clearly a Railway infrastructure issue. I chose Railway so I wouldn't have to deal with low-level DevOps problems like this, but my service has been offline for 3 days now, which is causing real problems for me.

My workflows contain important data that I haven't exported. Please fix this as soon as possible by expanding the filesystem.

I need a quick resolution. Please update me on when this will be done.

Thank you.


Welcome!

Sign in to your Railway account to join the conversation.

Loading...