Postgres crash loop / Potential Data Loss

knarx

PROOP

2 months ago

Postgres service is in a crash loop on redeploy. Container startup is invoking initdb unconditionally, which fails (correctly) because PGDATA is populated. Data appears intact on volume (though I don't see any data in the volume on the dashboard); the container's bootstrap script needs to detect existing PGDATA and skip initdb.

Also, I attempted to create a backup via the dashboard while my Postgres service is in the crash loop described above. The backup completed in and the dashboard shows it created successfully, but its size is identical to my previous backup from 3 weeks ago. Given the volume should have grown materially in that time, I'm concerned the backup operation isn't actually capturing current volume state. Please confirm whether the dashboard backup is reliable while the Postgres service is in a crash loop.

Closed$20 Bounty

14 Replies

Railway

BOT

2 months ago

This Postgres crash loop is very likely a downstream effect of the active service disruption that began at 02:25 UTC today, which triggered redeployments across the platform. Your service's logs confirm initdb is being invoked repeatedly against an already-populated PGDATA directory starting at 06:19 UTC. The incident is now in monitoring status and we are automatically redeploying unhealthy services - if yours doesn't recover on its own, try triggering a fresh redeploy from the dashboard. Regarding your backup concern: volume backups are volume-level snapshots and capture the data on disk regardless of whether the Postgres process is running, so the backup taken during the crash loop should reflect your current volume state. The dashboard may show stale incremental sizes due to frontend caching (sizes are cached for a couple of hours), which could explain why it appears identical to your previous backup.

Status changed to Awaiting User Response Railway • about 2 months ago

Railway

This Postgres crash loop is very likely a downstream effect of the [active service disruption](https://status.railway.com/incident/I23M92U0) that began at 02:25 UTC today, which triggered redeployments across the platform. Your service's logs confirm `initdb` is being invoked repeatedly against an already-populated PGDATA directory starting at 06:19 UTC. The incident is now in monitoring status and we are automatically redeploying unhealthy services - if yours doesn't recover on its own, try triggering a fresh redeploy from the dashboard. Regarding your backup concern: volume backups are volume-level snapshots and capture the data on disk regardless of whether the Postgres process is running, so the backup taken during the crash loop should reflect your current volume state. The dashboard may show stale incremental sizes due to frontend caching (sizes are cached for a couple of hours), which could explain why it appears identical to your previous backup.

knarx

PROOP

2 months ago

I've been trying to redeploy with the same issue. Is the volume not actually accessible because of the outage? The fact I can't see it's size in the dashboard is concerning.

Status changed to Awaiting Railway Response Railway • about 2 months ago

Railway

BOT

2 months ago

This thread has been marked as public for community involvement, as it does not contain any sensitive or personal information. Any further activity in this thread will be visible to everyone.

Status changed to Open Railway • about 2 months ago

chrishalaby

PRO

2 months ago

been having the same issue on 2 of my projects,

Symptoms:

Deployment briefly shows “successful” then immediately changes to “crashed”
Local connections to DATABASE_URL now fail
No backups/snapshots are available for this volume

Relevant logs:

received fast shutdown request

database system is shut down

Then later:

ERROR (catatonit:2): failed to exec pid1: No such file or directory

Current behavior:

Service starts for a few seconds
Then crashes repeatedly

Postgres logs mostly show normal checkpoints plus:

invalid length of startup packet
SSL negotiation errors
no PANIC/WAL corruption/fatal database errors visible

Things already tried:

Restarting service
Redeploying
Attempting local DB connection
Checking backups (none available)

dixonlesley-coder

HOBBY

2 months ago

same issue, shows successful for 1 second then immediately crashed

chrishalaby

PRO

2 months ago

I just checked, redeploying actually did work now, make sure you arent restarting the service rather redeploying

knarx

PROOP

2 months ago

yeah I'm redeploying, still crashing

knarx

PROOP

2 months ago

Critical update: I now have evidence the data on my volume has been replaced with the database's initial-state from when it was first provisioned. The pgvector service was created Feb 23 2026 around 01:38 UTC. The currently-mounted volume vol_jwwp0ua41m4nkdce contains 47MB of data with all file mtimes from 01:38-01:44 on Feb 23. The railway database exists but has zero public tables. Postgres log says 'database system was interrupted; last known up at 2026-02-23 01:43:48 UTC.'

My production database had ~3.7M products and was actively written to as of yesterday before the outage. Either the volume contents were rolled back to a Feb 23 snapshot, or the wrong volume was mounted. I need to know if you can locate the actual recent state of my data. This is potentially complete data loss for me and I need an answer urgently.

Project: trustworthy-forgiveness, service: pgvector.

knarx

PROOP

2 months ago

Two additional findings I should have included in the last post:

There may be a recoverable backup.

Both my existing 3-week-old backup and a new backup I attempted to create yesterday show 97 MB in the dashboard. That's notably larger than the 47 MB on the currently-mounted volume.

These can't both be accurate concurrent snapshots of the same volume contents. So either:

The backup system is reading from a different volume than what's mounted to my service, meaning my data may exist in Railway's storage but isn't attached to my pgvector service right now, or
The "new" backup didn't actually run and is just the cached prior snapshot, or
The backup is more space-efficient than on-disk and represents meaningfully more data

Whatever the case, 97 MB of backup data is dramatically better than 47 MB of empty initial-state volume. Even a 3-weeks-stale backup would be a partial recovery I can work with, vs total loss.

Can someone restore the 97 MB backup to a fresh volume — separate from vol_jwwp0ua41m4nkdce — so I can mount it and inspect what's in it? I want to confirm it contains a populated public schema before committing to it as the recovery target. Restoring as a parallel volume rather than overwriting in-place means we don't lose the diagnostic state of the current volume.

There's also a bootstrap path bug worth flagging for after recovery.

Even if the volume contents are restored, the docker-entrypoint will continue to fail on this service. The PGDATA env var is set to /var/lib/postgresql/data, but the actual data layout is in a subdirectory at /var/lib/postgresql/data/pgdata. docker-entrypoint checks $PGDATA/PG_VERSION, doesn't find it at the volume root, assumes uninitialized, and invokes initdb -D /var/lib/postgresql/data. initdb then refuses because the volume root has other contents.

The fix once the data is restored is either to set PGDATA=/var/lib/postgresql/data/pgdata in the service env vars, or restructure the volume layout. Posting this in case it's useful for the wider Postgres template — I assume the path structure was intentional to keep certs/ and lost+found/ separate from PGDATA, but the env var didn't get updated to match.

wadepierce

PRO

2 months ago

This seems to have crashed/affected almost every one of my projects. I'm having difficulty getting them to restart

wadepierce

This seems to have crashed/affected almost every one of my projects. I'm having difficulty getting them to restart

wadepierce

PRO

2 months ago

It seems redeploying has now started to resolve most of the issues. Wow. What a thing to wake up to!

knarx

PROOP

2 months ago

Update: The dashboard is now showing the PG volume at 97mb. I re-mounted the volume diagnostically to verify whether the dashboard's "97 MB" reflected new volume contents. It does not. The volume vol_jwwp0ua41m4nkdce still contains the same Feb 23 initial-bootstrap data from before: 46 MB total, base/16384 (the railway db) is 7.4 MB, all actual table data files have mtimes from 2026-02-23 01:43. The dashboard size display is misleading and it doesn't reflect the volume contents Postgres would actually serve from.

Whatever the 97 MB backup is reading from, it's a separate location from what's currently mounted. It'd be helpful if I could restore it to a fresh volume so I can inspect, or figuring out where my most recent database live which should be larger than the backup.

knarx

PROOP

2 months ago

Additional findings from continued investigation:

Volume size comparison with a duplicate service

I created a duplicate of my pgvector service (pgvector Copy) to compare volume behavior. Confirmed via diagnostic SSH access (custom start command set to sh -c "tail -f /dev/null" on both services).

Original service (pgvector):

Volume ID: vol_jwwp0ua41m4nkdce
Volume name: postgres-volume
Filesystem: 434 MB allocated, 46 MB used
Mount device: /dev/zd8416
Contents: certs/, lost+found/, pgdata/ (with PG_VERSION 17, Feb 23 bootstrap data)
Created: 2026-02-23

Duplicate service (pgvector Copy):

Volume ID: vol_po3vk2sv1c3909yc
Volume name: merry-volume
Filesystem: 46 GB allocated, 2.1 MB used
Mount device: /dev/zd2064
Contents: lost+found/ only (completely empty)
Created: 2026-05-20

The 100x size disparity between my original volume (434 MB) and a freshly-provisioned volume in the same project (46 GB) suggests my service may be on an older or smaller storage template than current Railway defaults.

Expected database size

My production database has been written to continuously for three months. The primary table alone had 3,741,627 rows as of a successful query the day before the incident, plus several million additional rows across related tables. A realistic on-disk size estimate is in the range of 8-12 GB, with an absolute theoretical floor of 3-4 GB once Postgres indexes are accounted for.

Conclusion

The currently-mounted volume (vol_jwwp0ua41m4nkdce, 434 MB allocated, 46 MB used) cannot physically have contained my production data at any point. It is too small by at least an order of magnitude. The data on it (Feb 23 initial-bootstrap state, 7.5 MB railway database, no production tables) is consistent with this being either the original initial-provisioning volume or an early test volume that was reattached during the post-GCP recovery process.

My actual production volume, sized 10-50 GB based on data volume estimates, must exist somewhere in Railway's storage layer and was likely orphaned or unattached during recovery. Please verify:

a) What volume was attached to my pgvector service immediately before the 2026-05-19 incident.

b) Whether that volume still exists in your storage layer.

c) Whether it can be reattached to my pgvector service, or whether its data can be migrated to a new volume mounted to the service.

Project: trustworthy-forgiveness

Project ID: 2497f3fe-fd67-4c1f-b8fb-c872f62c98f9

Service: pgvector

Current volume: vol_jwwp0ua41m4nkdce

knarx

PROOP

2 months ago

Combining a few key findings into one clearer narrative for the support team:

Service history

My pgvector service was created on the Railway Hobby plan on 2026-02-23, which at the time provisioned a fixed 500MB Postgres volume with no resize option. I subsequently upgraded to Pro plan. My production database has been written to continuously since then; as of yesterday morning the primary table alone contained 3,741,627 rows, with several million additional rows across related tables.

Dashboard metrics have been stuck at 98 MB for at least 30 days

The Railway dashboard's storage metric for my Postgres database has shown 98 MB consistently for the entire visible 30-day log window. 98 MB matches the initial-bootstrap state from very early in this service's life.

The mismatch

A database with 3.7M+ rows in the primary table, accumulating continuously for three months, cannot be 98 MB. A reasonable size estimate given the schema is 8-12 GB, with a theoretical floor around 3-4 GB once indexes are accounted for. So either my application's writes were silently failing for a month (they weren't - the application was functioning normally), or Railway's storage metrics for this service have been tracking a different volume than the one actually receiving my writes.

The likely sequence

The most plausible explanation is that the Hobby-to-Pro upgrade triggered some form of storage change - an in-place volume migration, a backing-store swap, or a service-level swap. After that point, production writes were landing on the new/larger volume, but Railway's bookkeeping (and dashboard metrics) continued referencing the original pre-upgrade 500MB volume.

The recovery failure mode

When the post-GCP-outage recovery process reattached "my volume" to the pgvector service, it used those stale records and pulled in the original Hobby-era 500MB volume (vol_jwwp0ua41m4nkdce, currently showing 46 MB used with all file mtimes from Feb 23). The actual production volume - wherever it lives in your storage layer - was not reattached.

What I'm asking

Please check your storage records for any volume tied to project trustworthy-forgiveness, service pgvector, that does NOT match the

currently-attached vol_jwwp0ua41m4nkdce. Focus on any storage activity around my Hobby-to-Pro plan upgrade, which is the most likely point at which the divergence started. If such a volume exists, please reattach it (or migrate its contents to a fresh volume mounted to the service).

The data should still exist - my writes were succeeding, the application was working, and Railway's backup mechanism reports 97 MB which is itself larger than the currently-mounted 46 MB volume, suggesting some other source of truth exists in your storage layer beyond what's visible to me.

Project: trustworthy-forgiveness

Project ID: 2497f3fe-fd67-4c1f-b8fb-c872f62c98f9

Service: pgvector

Current volume: vol_jwwp0ua41m4nkdce

knarx

PROOP

2 months ago

Quick update with sharper evidence + one concrete ask

Pulled the snapshot history off volumeInstanceBackupList and the contrast against a healthy volume on the same project is clear.

pgvector volume 46342e21-1955-4ed4-ae1d-dfeea78b5c76 — both snapshots flat at 97MB:

| Created | externalId | referencedMB |

|---|---|---|

| 2026-04-28 14:23 UTC | vs_1777386215051_f6rn4tf0yn6bzpgk | 97 |

| 2026-05-20 06:23 UTC | vs_1779258184033_e1bolewmd220iwu9 | 97 |

For comparison, my older meilisearch volume 81fb7ffa-ef84-405c-b561-6bf6e5ccaa3b on the same project — normal growth curve:

| Created | Name | referencedMB |

|---|---|---|

| 2026-02-26 | "Online resize to 1000MB" | 452 |

| 2026-03-04 | "Online resize to 5000MB" | 990 |

| 2026-05-05 | "Online resize to 10000MB" | 4152 |

The April 28 pgvector snapshot is three weeks before the outage. The volume metadata clearly stopped tracking real disk usage well before May 19, which is consistent with the Hobby→Pro theory I sketched in my previous reply.

Concrete ask for anyone with admin GraphQL access:

Could someone on the Railway team run adminVolumeInstancesForVolume(volumeId: "46342e21-1955-4ed4-ae1d-dfeea78b5c76")? If the underlying externalId ever changed during this logical volume's lifetime, the previous externalId should be on the GCP-side disk that was actually backing my writes.

I have a meeting set with Angelo from Railway on Tuesday next week for help, but would appreciate any insights the engineering team can give me in the interim.

Status changed to Closed brody • about 1 month ago

Welcome!