Volume resize silently failed — filesystem never expanded, caused crash loop and data loss

heyalexchoi

HOBBYOP

3 months ago

Service: meticulous-empathy (Postgres)

ProjectID: e27476ab-3a0f-4a26-8a40-108b5e4683ce

Environment: staging (8436b732-7464-4596-a695-a4bdbced9826)

ServiceID: 548900f7-2865-4005-8bef-bb959e3fb4cf

VolumeID: 521269e9-33ba-4957-8e91-a388d6d1aa43 (volume instance 24eeaa53-6053-4496-8081-9c217cf5115a)

---

Whathappened:

I resized the postgres volume from ~500MB to 5GB via the Railway dashboard. The UI showed the resize as successful. However, the block device was expanded but resize2fs

was never run — the ext4 filesystem inside the volume remained at 434MB. The database had no idea it had more space available.

Postgres had already crashed due to a full disk (ENOSPC while writing a WAL xlogtemp file). After the "successful" resize, postgres continued crash-looping because the

filesystem was still full.

Verified on the currently running container right now:

$ df -h /var/lib/postgresql/data

Filesystem Size Used Avail Use% Mounted on

/dev/zd7136 433M 310M 123M 72% /var/lib/postgresql/data

$ grep zd7136 /proc/partitions

230 7136 4886528 zd7136 ← block device is correctly 4.7GB

The block device (/dev/zd7136, major=230 minor=7136) is 4.7GB as expected. The ext4 filesystem on top of it is still 434MB. resize2fs was never run.

Impact:

Because the filesystem wasn't expanded, I had to manually delete WAL files to free enough space to run pg_resetwal -f, which caused dataloss — approximately one

checkpoint's worth of transactions (data created the previous night) is unrecoverable.

WhatIneed:

Please run resize2fs /dev/zd7136 on the host for this volume. The filesystem should expand from 434MB to the full ~4.7GB. The volume is currently mounted and postgres

is running, so an online resize is preferable, but I can tolerate a brief restart if needed.

Your January 2026 changelog explicitly states live volume resizing "automatically extends the filesystem to utilize the additional space" — that did not happen here.

Solved

10 Replies

Status changed to Awaiting Railway Response Railway • 3 months ago

brody

EMPLOYEE

3 months ago

Hey, sorry about this. You're right that the volume resize didn't fully apply, so your database was still working with the original ~480 MB of space despite the dashboard showing 5 GB.

We've corrected this on our end and redeployed your service. Your Postgres volume now has the full ~4.7 GB of available disk space.

Status changed to Awaiting User Response Railway • 3 months ago

brody

Hey, sorry about this. You're right that the volume resize didn't fully apply, so your database was still working with the original ~480 MB of space despite the dashboard showing 5 GB. We've corrected this on our end and redeployed your service. Your Postgres volume now has the full ~4.7 GB of available disk space.

heyalexchoi

HOBBYOP

3 months ago

Hey, thanks for jumping on this and getting the filesystem resized.

Unfortunately after your redeploy, Postgres still isn't starting. Here are the logs from the Postgres service:

2026-04-07 00:13:17.472 UTC [6] LOG: could not open file "postmaster.pid": Input/output error; continuing anyway

2026-04-07 00:13:17.473 UTC [140] FATAL: could not open file "global/pg_filenode.map": Input/output error

To my knowledge, the database was actually running after my manual recovery steps — the issues seem to have started after your resize/redeploy. A few follow-up

questions:

1. What exactly happened during the redeploy — did you remount the volume or make any changes to the data directory itself?

2. Can you verify the volume is healthy at the block device level and that the filesystem is readable?

3. Do you have a snapshot of the volume from before your intervention that could be restored?

I'd really like to recover the data if possible. Can you help get this sorted?

Status changed to Awaiting Railway Response Railway • 3 months ago

brody

EMPLOYEE

3 months ago

Fixed and confirmed back online. Sorry about that!

Status changed to Awaiting User Response Railway • 3 months ago

brody

Fixed and confirmed back online. Sorry about that!

heyalexchoi

HOBBYOP

3 months ago

thank you, appreciate the help. jfyi several / all of my db indexes were corrupted. had to reindex.

```

sqlalchemy.exc.InternalError: (sqlalchemy.dialects.postgresql.asyncpg.InternalServerError) <class 'asyncpg.exceptions.IndexCorruptedError'>: index "pg_toast_16436_index" contains unexpected zero page at block 121

HINT: Please REINDEX it.

```

Status changed to Awaiting Railway Response Railway • 3 months ago

heyalexchoi

thank you, appreciate the help. jfyi several / all of my db indexes were corrupted. had to reindex. \`\`\` sqlalchemy.exc.InternalError: (sqlalchemy.dialects.postgresql.asyncpg.InternalServerError) <class 'asyncpg.exceptions.IndexCorruptedError'>: index "pg\_toast\_16436\_index" contains unexpected zero page at block 121 HINT: Please REINDEX it. \`\`\`

heyalexchoi

HOBBYOP

3 months ago

actually seeing more widespread data corruption.

sqlalchemy.exc.InternalError: (sqlalchemy.dialects.postgresql.asyncpg.InternalServerError) <class 'asyncpg.exceptions.DataCorruptedError'>: missing chunk number 0 for

toast value 24626 in pg_toast_16436

heyalexchoi

HOBBYOP

3 months ago

I deleted my data to recover my environment and restore service. you guys should probably revisit your volume resizing - it resulted in total data loss.

brody

EMPLOYEE

3 months ago

The volume resize bug that prevented the filesystem from extending has been fixed, and we apologize for that. Going forward, volume resizes will correctly expand the filesystem automatically. That said, the data loss was a result of the manual recovery steps taken (deleting WAL files and running pg_resetwal), not the resize bug itself. The resize created the disk pressure, but the unrecoverable corruption came from those manual interventions.

Status changed to Awaiting User Response Railway • 3 months ago

brody

heyalexchoi

HOBBYOP

3 months ago

Possibly, but you should know the system was working without corruption issues after my manual interventions. Corruption showed up after your resize and redeploy intervention.

Status changed to Awaiting Railway Response Railway • 3 months ago

brody

EMPLOYEE

3 months ago

We understand the timing, but the data corruption is consistent with the expected risks of your manual recovery steps. The resize bug was ours and has been fixed. The subsequent data loss, however, was a result of your manual interventions.

Status changed to Awaiting User Response Railway • 3 months ago

sam-a

EMPLOYEE

3 months ago

The underlying issue has been resolved. If you still are having a problem with your volume it may require individual intervention. Please reply here and let us know and we can try to address it. Thanks!

Railway

BOT

2 months ago

This thread has been marked as solved automatically due to a lack of recent activity. Please re-open this thread or create a new one if you require further assistance. Thank you!

Status changed to Solved Railway • 3 months ago

Welcome!