Automatic Railway Postgres update wiped the database volumes - Railway Central Station

Automatic Railway Postgres update wiped the database volumes

rsproule

PROOP

a year ago

We see a railway automatic deployment of 2 postgres instances in our environment around 50 minutes ago that completely wiped the volumes. Our service immediately started alarming us with errors that table_xyz does not exist.

This was our staging environment so only our dev teams were impacted but we are very concerned about this happening in production.

Project id: 979d8e07-4c99-427d-a247-04265f375550

EnvironmentId c39c1bd0-9903-48d6-a25f-1f1d02a5eda7

36 Replies

rsproule

PROOP

a year ago

I would like to get some sense of what went on here for me to know how to prevent this in prod

a year ago

Hey, is your database still wiped?

rsproule

PROOP

a year ago

yes it never recovered. a new migration executed on service restart

a year ago

And yes, unfortunately this was Railway's fault.

We've identified that a component in our deployment infrastructure became unresponsive, which caused your deployment to hang before eventually failing. We sincerely apologize for the frustration this caused.

We've now implemented additional monitoring to detect this type of issue immediately, so our team can resolve it much faster if it happens again.

a year ago

Do you want to recover it? I can raise it to the team.

rsproule

PROOP

a year ago

yes is that possible. its unclear to me how hang in the deployment causes a disk wipe. is there more i should be doing in production environment to prevent against this. this would have been very costly to us in production.

a year ago

I'll raise this to team, unfortunately I do not know most of the answers to your question.

cc @Brody

a year ago

Hello,

The issue the two other users faced would not cause any data loss.

a year ago

Oh my bad then, ignore what I said.

a year ago

Should I be looking at tax-postgres?

rsproule

PROOP

a year ago

yes, same thing happened to ponder-db in same env

rsproule

PROOP

a year ago

we did not see the hung deployment. deployments were successful but just all the data was gone

a year ago

Looks like ponder-db was last deployed around 2 months ago? should I be looking at another environment? I am looking at staging right now.

rsproule

PROOP

a year ago

just tax-postgres. ignore previous message about ponder-db

a year ago

The team is looking into this now.

a year ago

Would you mind checking now?

a year ago

I believe it's been re-initialized correctly (happy to cover what happened after and what we'll do to fix it in a sec)

rsproule

PROOP

a year ago

yes looks like data has been restored. reminder to encrypt at rest seeing u guys poke around in the table 😂

rsproule

PROOP

a year ago

looking forward to full diagnostic and some advice on protecting our production services from such issues

a year ago

Everything is encrypted at rest on our side!

rsproule

PROOP

a year ago

but presumably you can view tables the same way we can through the webapp

a year ago

Yes. We have a BAA for enterprise that will make it so that someone internally needs your consent

But, in this case, I actually just went off metrics

rsproule

PROOP

a year ago

this

1385077192303837275

a year ago

What happened, and I have to figure out why is that you had 2 identical volumes (as in, they had the same unique identifier)

However one of them was older. I think it might have picked an older snapshot. Why? Ill dig in

a year ago

Yup

rsproule

PROOP

a year ago

would love a full post mortem here and some advice on protecting ourselves from such events in prod. thanks

a year ago

Yup. We believe this occurs when you have backups and a migration

The backup fires, the migration grabs the new partial as gospel, and then migrates that

a year ago

Will get confirmation for this today

a year ago

!remind me to follow up in 4 hours

a year ago

(Latest)

a year ago

just a heads up that he left the server

rsproule

PROOP

10 months ago

im back

rsproule

PROOP

10 months ago

this is still pretty insane to us that we had a volume just bricked. would like some further explanation and some credits

10 months ago

Oh, interesting. I didn't follow up

10 months ago

This is 100% what happened here. We patched it right after you reported it. It's definitely obscene; 1:1B+ collision racecase

10 months ago

Went ahead and applied some credits to your account for the issue!

Welcome!

Sign in to your Railway account to join the conversation.