Production database service replaced with wrong image, data on volume but unreadable

carlosenzos

PROOP

2 months ago

Project: broll-system (id 7ac0a455-04c6-4583-b470-75ade9f31f2a)

Service: pgvector-db (id f0889861-dd58-47ce-8ea9-a41f78139ffe)

Volume: pgvector-db-volume (id 7a95edcd-52d2-409c-bbd9-aa49237c04da)

Approximate incident time: 2026-04-30 ~12:00 UTC

What happened:

1. We accidentally ran railway up while the CLI was linked to pgvector-db (a managed Postgres service),

deploying a different repo's source code (FastAPI app) onto it.

2. The deploy replaced the Postgres image with a custom build, breaking the database service.

3. We attempted to roll back to the previous Postgres deployment, but the rollback re-initialized PGDATA into what

appears to be a new sub-directory, leaving the original Postgres data files on the volume but unread.

4. The volume currently shows ~1-2 GB of data (Railway dashboard shows volume usage ~1.5 GB).

5. Connecting to pgvector-db now returns "No supported database found in service" via railway connect, and direct

psql connection fails with "server closed connection unexpectedly".

What we need:

We believe the original PostgreSQL data files are still on the pgvector-db-volume, but the current container is

not reading them. Could you mount the volume to a fresh PostgreSQL container with the correct PGDATA path, attempt

to start Postgres on the existing data files, and let us know what's recoverable?

Recent manual backup (1.08 GB, 16:08 UTC) appears to have been taken AFTER the incident and contains only ~1% of

expected rows — not usable for recovery.

Time-sensitive: production B-roll pipeline is halted. ~200,000 clip records may be lost without volume-level

recovery.

Please confirm receipt and let us know expected response time. Happy to provide any further details.

$20 Bounty

2 Replies

Status changed to Awaiting Railway Response Railway • 2 months ago

Railway

BOT

2 months ago

This thread has been marked as public for community involvement, as it does not contain any sensitive or personal information. Any further activity in this thread will be visible to everyone.

Status changed to Open Railway • 2 months ago

passos

MODERATOR

2 months ago

Hey, I also strongly believe that the data is still in the volume. My guess is that the service got misconfigured (either through invalid environment variables or other issues).

I would recommend creating another pgvector-db, detaching the volume from your current broken pgvector service, and attaching it to the new pgvector-db. That should hopefully resolve the issue.

Additionally, you can SSH into the service to investigate further. Perhaps the data volume was moved or wiped. Confirming via SSH that the data is still there would reassure us that it's just a misconfiguration or missing configuration option.

Let us know if you hit any roadblocks while doing the suggestions above!

hjh1234521

FREE

2 months ago

I would not try this directly on the original volume first.

Before any recovery attempt, preserve the current volume state. If Railway supports cloning/snapshotting the volume, work from the clone. Avoid any deploy that may run initdb, migrations, or other writes against the original volume.

This sounds like the current container may be using a new PGDATA path while the original PostgreSQL cluster still exists elsewhere on the mounted volume.

A conservative recovery path would be:

1. Confirm the original PostgreSQL major version.

2. Create a fresh PostgreSQL/pgvector service using the same major version.

3. Attach a cloned/snapshotted copy of the volume if possible.

4. Inspect the mounted volume before allowing Postgres to initialize anything.

5. Locate the real data directory. It should contain PG_VERSION, base/, global/, and pg_wal/.

6. If the original cluster is in a nested directory, set PGDATA to that exact path.

7. If it starts successfully, immediately run pg_dump or pg_dumpall and restore into a clean database service.

If Postgres attempts to initialize a new cluster, stop immediately. That means it is still not pointed at the original data directory.

I would avoid doing any recovery attempts directly on the only copy of the production volume.This should ideally be handled from a clone/snapshot by someone with Railway/Postgres recovery experience.

Welcome!