a day ago
Postgres service is in a crash loop on redeploy. Container startup is invoking initdb unconditionally, which fails (correctly) because PGDATA is populated. Data appears intact on volume (though I don't see any data in the volume on the dashboard); the container's bootstrap script needs to detect existing PGDATA and skip initdb.
Also, I attempted to create a backup via the dashboard while my Postgres service is in the crash loop described above. The backup completed in and the dashboard shows it created successfully, but its size is identical to my previous backup from 3 weeks ago. Given the volume should have grown materially in that time, I'm concerned the backup operation isn't actually capturing current volume state. Please confirm whether the dashboard backup is reliable while the Postgres service is in a crash loop.
13 Replies
a day ago
This Postgres crash loop is very likely a downstream effect of the active service disruption that began at 02:25 UTC today, which triggered redeployments across the platform. Your service's logs confirm initdb is being invoked repeatedly against an already-populated PGDATA directory starting at 06:19 UTC. The incident is now in monitoring status and we are automatically redeploying unhealthy services - if yours doesn't recover on its own, try triggering a fresh redeploy from the dashboard. Regarding your backup concern: volume backups are volume-level snapshots and capture the data on disk regardless of whether the Postgres process is running, so the backup taken during the crash loop should reflect your current volume state. The dashboard may show stale incremental sizes due to frontend caching (sizes are cached for a couple of hours), which could explain why it appears identical to your previous backup.
Status changed to Awaiting User Response Railway • 1 day ago
Railway
This Postgres crash loop is very likely a downstream effect of the [active service disruption](https://status.railway.com/incident/I23M92U0) that began at 02:25 UTC today, which triggered redeployments across the platform. Your service's logs confirm `initdb` is being invoked repeatedly against an already-populated PGDATA directory starting at 06:19 UTC. The incident is now in monitoring status and we are automatically redeploying unhealthy services - if yours doesn't recover on its own, try triggering a fresh redeploy from the dashboard. Regarding your backup concern: volume backups are volume-level snapshots and capture the data on disk regardless of whether the Postgres process is running, so the backup taken during the crash loop should reflect your current volume state. The dashboard may show stale incremental sizes due to frontend caching (sizes are cached for a couple of hours), which could explain why it appears identical to your previous backup.
a day ago
I've been trying to redeploy with the same issue. Is the volume not actually accessible because of the outage? The fact I can't see it's size in the dashboard is concerning.
Status changed to Awaiting Railway Response Railway • 1 day ago
a day ago
This thread has been marked as public for community involvement, as it does not contain any sensitive or personal information. Any further activity in this thread will be visible to everyone.
Status changed to Open Railway • 1 day ago
a day ago
been having the same issue on 2 of my projects,
Symptoms:
- Deployment briefly shows “successful” then immediately changes to “crashed”
- Local connections to DATABASE_URL now fail
- No backups/snapshots are available for this volume
Relevant logs:
received fast shutdown request
database system is shut down
Then later:
ERROR (catatonit:2): failed to exec pid1: No such file or directory
Current behavior:
- Service starts for a few seconds
- Then crashes repeatedly
Postgres logs mostly show normal checkpoints plus:
- invalid length of startup packet
- SSL negotiation errors
- no PANIC/WAL corruption/fatal database errors visible
Things already tried:
- Restarting service
- Redeploying
- Attempting local DB connection
- Checking backups (none available)
a day ago
same issue, shows successful for 1 second then immediately crashed
a day ago
I just checked, redeploying actually did work now, make sure you arent restarting the service rather redeploying
a day ago
yeah I'm redeploying, still crashing
21 hours ago
Critical update: I now have evidence the data on my volume has been replaced with the database's initial-state from when it was first provisioned. The pgvector service was created Feb 23 2026 around 01:38 UTC. The currently-mounted volume vol_jwwp0ua41m4nkdce contains 47MB of data with all file mtimes from 01:38-01:44 on Feb 23. The railway database exists but has zero public tables. Postgres log says 'database system was interrupted; last known up at 2026-02-23 01:43:48 UTC.'
My production database had ~3.7M products and was actively written to as of yesterday before the outage. Either the volume contents were rolled back to a Feb 23 snapshot, or the wrong volume was mounted. I need to know if you can locate the actual recent state of my data. This is potentially complete data loss for me and I need an answer urgently.
Project: trustworthy-forgiveness, service: pgvector.
20 hours ago
Two additional findings I should have included in the last post:
- There may be a recoverable backup.
Both my existing 3-week-old backup and a new backup I attempted to create yesterday show 97 MB in the dashboard. That's notably larger than the 47 MB on the currently-mounted volume.
These can't both be accurate concurrent snapshots of the same volume contents. So either:
- The backup system is reading from a different volume than what's mounted to my service, meaning my data may exist in Railway's storage but isn't attached to my pgvector service right now, or
- The "new" backup didn't actually run and is just the cached prior snapshot, or
- The backup is more space-efficient than on-disk and represents meaningfully more data
Whatever the case, 97 MB of backup data is dramatically better than 47 MB of empty initial-state volume. Even a 3-weeks-stale backup would be a partial recovery I can work with, vs total loss.
Can someone restore the 97 MB backup to a fresh volume — separate from vol_jwwp0ua41m4nkdce — so I can mount it and inspect what's in it? I want to confirm it contains a populated public schema before committing to it as the recovery target. Restoring as a parallel volume rather than overwriting in-place means we don't lose the diagnostic state of the current volume.
- There's also a bootstrap path bug worth flagging for after recovery.
Even if the volume contents are restored, the docker-entrypoint will continue to fail on this service. The PGDATA env var is set to /var/lib/postgresql/data, but the actual data layout is in a subdirectory at /var/lib/postgresql/data/pgdata. docker-entrypoint checks $PGDATA/PG_VERSION, doesn't find it at the volume root, assumes uninitialized, and invokes initdb -D /var/lib/postgresql/data. initdb then refuses because the volume root has other contents.
The fix once the data is restored is either to set PGDATA=/var/lib/postgresql/data/pgdata in the service env vars, or restructure the volume layout. Posting this in case it's useful for the wider Postgres template — I assume the path structure was intentional to keep certs/ and lost+found/ separate from PGDATA, but the env var didn't get updated to match.
20 hours ago
This seems to have crashed/affected almost every one of my projects. I'm having difficulty getting them to restart
wadepierce
This seems to have crashed/affected almost every one of my projects. I'm having difficulty getting them to restart
20 hours ago
It seems redeploying has now started to resolve most of the issues. Wow. What a thing to wake up to!
17 hours ago
Update: The dashboard is now showing the PG volume at 97mb. I re-mounted the volume diagnostically to verify whether the dashboard's "97 MB" reflected new volume contents. It does not. The volume vol_jwwp0ua41m4nkdce still contains the same Feb 23 initial-bootstrap data from before: 46 MB total, base/16384 (the railway db) is 7.4 MB, all actual table data files have mtimes from 2026-02-23 01:43. The dashboard size display is misleading and it doesn't reflect the volume contents Postgres would actually serve from.
Whatever the 97 MB backup is reading from, it's a separate location from what's currently mounted. It'd be helpful if I could restore it to a fresh volume so I can inspect, or figuring out where my most recent database live which should be larger than the backup.
7 hours ago
Additional findings from continued investigation:
- Volume size comparison with a duplicate service
I created a duplicate of my pgvector service (pgvector Copy) to compare volume behavior. Confirmed via diagnostic SSH access (custom start command set to sh -c "tail -f /dev/null" on both services).
I created a duplicate of my pgvector service (pgvector Copy) to compare volume behavior. Confirmed via diagnostic SSH access (custom start command set to sh -c "tail -f /dev/null" on both services).
Original service (pgvector):
- Volume ID: vol_jwwp0ua41m4nkdce
- Volume name: postgres-volume
- Filesystem: 434 MB allocated, 46 MB used
- Mount device: /dev/zd8416
- Contents: certs/, lost+found/, pgdata/ (with PG_VERSION 17, Feb 23 bootstrap data)
- Created: 2026-02-23
Duplicate service (pgvector Copy):
- Volume ID: vol_po3vk2sv1c3909yc
- Volume name: merry-volume
- Filesystem: 46 GB allocated, 2.1 MB used
- Mount device: /dev/zd2064
- Contents: lost+found/ only (completely empty)
- Created: 2026-05-20
The 100x size disparity between my original volume (434 MB) and a freshly-provisioned volume in the same project (46 GB) suggests my service may be on an older or smaller storage template than current Railway defaults.
- Expected database size
My production database has been written to continuously for three months. The primary table alone had 3,741,627 rows as of a successful query the day before the incident, plus several million additional rows across related tables. A realistic on-disk size estimate is in the range of 8-12 GB, with an absolute theoretical floor of 3-4 GB once Postgres indexes are accounted for.
- Conclusion
The currently-mounted volume (vol_jwwp0ua41m4nkdce, 434 MB allocated, 46 MB used) cannot physically have contained my production data at any point. It is too small by at least an order of magnitude. The data on it (Feb 23 initial-bootstrap state, 7.5 MB railway database, no production tables) is consistent with this being either the original initial-provisioning volume or an early test volume that was reattached during the post-GCP recovery process.
My actual production volume, sized 10-50 GB based on data volume estimates, must exist somewhere in Railway's storage layer and was likely orphaned or unattached during recovery. Please verify:
a) What volume was attached to my pgvector service immediately before the 2026-05-19 incident.
b) Whether that volume still exists in your storage layer.
c) Whether it can be reattached to my pgvector service, or whether its data can be migrated to a new volume mounted to the service.
Project: trustworthy-forgiveness
Project ID: 2497f3fe-fd67-4c1f-b8fb-c872f62c98f9
Service: pgvector
Current volume: vol_jwwp0ua41m4nkdce
7 hours ago
Combining a few key findings into one clearer narrative for the support team:
- Service history
My pgvector service was created on the Railway Hobby plan on 2026-02-23, which at the time provisioned a fixed 500MB Postgres volume with no resize option. I subsequently upgraded to Pro plan. My production database has been written to continuously since then; as of yesterday morning the primary table alone contained 3,741,627 rows, with several million additional rows across related tables.
- Dashboard metrics have been stuck at 98 MB for at least 30 days
The Railway dashboard's storage metric for my Postgres database has shown 98 MB consistently for the entire visible 30-day log window. 98 MB matches the initial-bootstrap state from very early in this service's life.
- The mismatch
A database with 3.7M+ rows in the primary table, accumulating continuously for three months, cannot be 98 MB. A reasonable size estimate given the schema is 8-12 GB, with a theoretical floor around 3-4 GB once indexes are accounted for. So either my application's writes were silently failing for a month (they weren't - the application was functioning normally), or Railway's storage metrics for this service have been tracking a different volume than the one actually receiving my writes.
- The likely sequence
The most plausible explanation is that the Hobby-to-Pro upgrade triggered some form of storage change - an in-place volume migration, a backing-store swap, or a service-level swap. After that point, production writes were landing on the new/larger volume, but Railway's bookkeeping (and dashboard metrics) continued referencing the original pre-upgrade 500MB volume.
- The recovery failure mode
When the post-GCP-outage recovery process reattached "my volume" to the pgvector service, it used those stale records and pulled in the original Hobby-era 500MB volume (vol_jwwp0ua41m4nkdce, currently showing 46 MB used with all file mtimes from Feb 23). The actual production volume - wherever it lives in your storage layer - was not reattached.
What I'm asking
Please check your storage records for any volume tied to project trustworthy-forgiveness, service pgvector, that does NOT match the
currently-attached vol_jwwp0ua41m4nkdce. Focus on any storage activity around my Hobby-to-Pro plan upgrade, which is the most likely point at which the divergence started. If such a volume exists, please reattach it (or migrate its contents to a fresh volume mounted to the service).
The data should still exist - my writes were succeeding, the application was working, and Railway's backup mechanism reports 97 MB which is itself larger than the currently-mounted 46 MB volume, suggesting some other source of truth exists in your storage layer beyond what's visible to me.
Project: trustworthy-forgiveness
Project ID: 2497f3fe-fd67-4c1f-b8fb-c872f62c98f9
Service: pgvector
Current volume: vol_jwwp0ua41m4nkdce