2 months ago
Service: meticulous-empathy (Postgres)
ProjectID: e27476ab-3a0f-4a26-8a40-108b5e4683ce
Environment: staging (8436b732-7464-4596-a695-a4bdbced9826)
ServiceID: 548900f7-2865-4005-8bef-bb959e3fb4cf
VolumeID: 521269e9-33ba-4957-8e91-a388d6d1aa43 (volume instance 24eeaa53-6053-4496-8081-9c217cf5115a)
---
Whathappened:
I resized the postgres volume from ~500MB to 5GB via the Railway dashboard. The UI showed the resize as successful. However, the block device was expanded but resize2fs
was never run — the ext4 filesystem inside the volume remained at 434MB. The database had no idea it had more space available.
Postgres had already crashed due to a full disk (ENOSPC while writing a WAL xlogtemp file). After the "successful" resize, postgres continued crash-looping because the
filesystem was still full.
Verified on the currently running container right now:
$ df -h /var/lib/postgresql/data
Filesystem Size Used Avail Use% Mounted on
/dev/zd7136 433M 310M 123M 72% /var/lib/postgresql/data
$ grep zd7136 /proc/partitions
230 7136 4886528 zd7136 ← block device is correctly 4.7GB
The block device (/dev/zd7136, major=230 minor=7136) is 4.7GB as expected. The ext4 filesystem on top of it is still 434MB. resize2fs was never run.
Impact:
Because the filesystem wasn't expanded, I had to manually delete WAL files to free enough space to run pg_resetwal -f, which caused dataloss — approximately one
checkpoint's worth of transactions (data created the previous night) is unrecoverable.
WhatIneed:
Please run resize2fs /dev/zd7136 on the host for this volume. The filesystem should expand from 434MB to the full ~4.7GB. The volume is currently mounted and postgres
is running, so an online resize is preferable, but I can tolerate a brief restart if needed.
Your January 2026 changelog explicitly states live volume resizing "automatically extends the filesystem to utilize the additional space" — that did not happen here.
10 Replies
Status changed to Awaiting Railway Response Railway • about 2 months ago
2 months ago
Hey, sorry about this. You're right that the volume resize didn't fully apply, so your database was still working with the original ~480 MB of space despite the dashboard showing 5 GB.
We've corrected this on our end and redeployed your service. Your Postgres volume now has the full ~4.7 GB of available disk space.
Status changed to Awaiting User Response Railway • about 2 months ago
brody
Hey, sorry about this. You're right that the volume resize didn't fully apply, so your database was still working with the original ~480 MB of space despite the dashboard showing 5 GB. We've corrected this on our end and redeployed your service. Your Postgres volume now has the full ~4.7 GB of available disk space.
2 months ago
Hey, thanks for jumping on this and getting the filesystem resized.
Unfortunately after your redeploy, Postgres still isn't starting. Here are the logs from the Postgres service:
2026-04-07 00:13:17.472 UTC [6] LOG: could not open file "postmaster.pid": Input/output error; continuing anyway
2026-04-07 00:13:17.473 UTC [140] FATAL: could not open file "global/pg_filenode.map": Input/output error
To my knowledge, the database was actually running after my manual recovery steps — the issues seem to have started after your resize/redeploy. A few follow-up
questions:
1. What exactly happened during the redeploy — did you remount the volume or make any changes to the data directory itself?
2. Can you verify the volume is healthy at the block device level and that the filesystem is readable?
3. Do you have a snapshot of the volume from before your intervention that could be restored?
I'd really like to recover the data if possible. Can you help get this sorted?
Status changed to Awaiting Railway Response Railway • about 2 months ago
Status changed to Awaiting User Response Railway • about 2 months ago
brody
Fixed and confirmed back online. Sorry about that!
a month ago
thank you, appreciate the help. jfyi several / all of my db indexes were corrupted. had to reindex.
```
sqlalchemy.exc.InternalError: (sqlalchemy.dialects.postgresql.asyncpg.InternalServerError) <class 'asyncpg.exceptions.IndexCorruptedError'>: index "pg_toast_16436_index" contains unexpected zero page at block 121
HINT: Please REINDEX it.
```
Status changed to Awaiting Railway Response Railway • about 2 months ago
heyalexchoi
thank you, appreciate the help. jfyi several / all of my db indexes were corrupted. had to reindex. \`\`\` sqlalchemy.exc.InternalError: (sqlalchemy.dialects.postgresql.asyncpg.InternalServerError) <class 'asyncpg.exceptions.IndexCorruptedError'>: index "pg\_toast\_16436\_index" contains unexpected zero page at block 121 HINT: Please REINDEX it. \`\`\`
a month ago
actually seeing more widespread data corruption.
sqlalchemy.exc.InternalError: (sqlalchemy.dialects.postgresql.asyncpg.InternalServerError) <class 'asyncpg.exceptions.DataCorruptedError'>: missing chunk number 0 for
toast value 24626 in pg_toast_16436
a month ago
I deleted my data to recover my environment and restore service. you guys should probably revisit your volume resizing - it resulted in total data loss.
a month ago
The volume resize bug that prevented the filesystem from extending has been fixed, and we apologize for that. Going forward, volume resizes will correctly expand the filesystem automatically. That said, the data loss was a result of the manual recovery steps taken (deleting WAL files and running pg_resetwal), not the resize bug itself. The resize created the disk pressure, but the unrecoverable corruption came from those manual interventions.
Status changed to Awaiting User Response Railway • about 1 month ago
brody
The volume resize bug that prevented the filesystem from extending has been fixed, and we apologize for that. Going forward, volume resizes will correctly expand the filesystem automatically. That said, the data loss was a result of the manual recovery steps taken (deleting WAL files and running pg_resetwal), not the resize bug itself. The resize created the disk pressure, but the unrecoverable corruption came from those manual interventions.
a month ago
Possibly, but you should know the system was working without corruption issues after my manual interventions. Corruption showed up after your resize and redeploy intervention.
Status changed to Awaiting Railway Response Railway • about 1 month ago
a month ago
We understand the timing, but the data corruption is consistent with the expected risks of your manual recovery steps. The resize bug was ours and has been fixed. The subsequent data loss, however, was a result of your manual interventions.
Status changed to Awaiting User Response Railway • about 1 month ago
a month ago
The underlying issue has been resolved. If you still are having a problem with your volume it may require individual intervention. Please reply here and let us know and we can try to address it. Thanks!
a month ago
This thread has been marked as solved automatically due to a lack of recent activity. Please re-open this thread or create a new one if you require further assistance. Thank you!
Status changed to Solved Railway • 30 days ago