Automatic Railway Postgres update wiped the database volumes
rsproule
PROOP

8 months ago

We see a railway automatic deployment of 2 postgres instances in our environment around 50 minutes ago that completely wiped the volumes. Our service immediately started alarming us with errors that table_xyz does not exist.

This was our staging environment so only our dev teams were impacted but we are very concerned about this happening in production.

Project id: 979d8e07-4c99-427d-a247-04265f375550
EnvironmentId c39c1bd0-9903-48d6-a25f-1f1d02a5eda7

36 Replies

rsproule
PROOP

8 months ago

I would like to get some sense of what went on here for me to know how to prevent this in prod


passos
MODERATOR

8 months ago

Hey, is your database still wiped?


rsproule
PROOP

8 months ago

yes it never recovered. a new migration executed on service restart


passos
MODERATOR

8 months ago

And yes, unfortunately this was Railway's fault.

We've identified that a component in our deployment infrastructure became unresponsive, which caused your deployment to hang before eventually failing. We sincerely apologize for the frustration this caused.

We've now implemented additional monitoring to detect this type of issue immediately, so our team can resolve it much faster if it happens again.

passos
MODERATOR

8 months ago

Do you want to recover it? I can raise it to the team.


rsproule
PROOP

8 months ago

yes is that possible. its unclear to me how hang in the deployment causes a disk wipe. is there more i should be doing in production environment to prevent against this. this would have been very costly to us in production.


passos
MODERATOR

8 months ago

I'll raise this to team, unfortunately I do not know most of the answers to your question.
cc @Brody


brody
EMPLOYEE

8 months ago

Hello,

The issue the two other users faced would not cause any data loss.


passos
MODERATOR

8 months ago

Oh my bad then, ignore what I said.


brody
EMPLOYEE

8 months ago

Should I be looking at tax-postgres?


rsproule
PROOP

8 months ago

yes, same thing happened to ponder-db in same env


rsproule
PROOP

8 months ago

we did not see the hung deployment. deployments were successful but just all the data was gone


brody
EMPLOYEE

8 months ago

Looks like ponder-db was last deployed around 2 months ago? should I be looking at another environment? I am looking at staging right now.


rsproule
PROOP

8 months ago

just tax-postgres. ignore previous message about ponder-db


brody
EMPLOYEE

8 months ago

The team is looking into this now.


jake
EMPLOYEE

8 months ago

Would you mind checking now?


jake
EMPLOYEE

8 months ago

I believe it's been re-initialized correctly (happy to cover what happened after and what we'll do to fix it in a sec)


rsproule
PROOP

8 months ago

yes looks like data has been restored. reminder to encrypt at rest seeing u guys poke around in the table 😂


rsproule
PROOP

8 months ago

looking forward to full diagnostic and some advice on protecting our production services from such issues


jake
EMPLOYEE

8 months ago

Everything is encrypted at rest on our side!


rsproule
PROOP

8 months ago

but presumably you can view tables the same way we can through the webapp


jake
EMPLOYEE

8 months ago

Yes. We have a BAA for enterprise that will make it so that someone internally needs your consent

But, in this case, I actually just went off metrics


rsproule
PROOP

8 months ago

this

1385077192303837200


jake
EMPLOYEE

8 months ago

What happened, and I have to figure out why is that you had 2 identical volumes (as in, they had the same unique identifier)

However one of them was older. I think it might have picked an older snapshot. Why? Ill dig in


jake
EMPLOYEE

8 months ago

Yup


rsproule
PROOP

8 months ago

would love a full post mortem here and some advice on protecting ourselves from such events in prod. thanks


jake
EMPLOYEE

8 months ago

Yup. We believe this occurs when you have backups and a migration

The backup fires, the migration grabs the new partial as gospel, and then migrates that


jake
EMPLOYEE

8 months ago

Will get confirmation for this today


jake
EMPLOYEE

8 months ago

!remind me to follow up in 4 hours


jake
EMPLOYEE

8 months ago

(Latest)


passos
MODERATOR

8 months ago

just a heads up that he left the server


rsproule
PROOP

7 months ago

im back


rsproule
PROOP

7 months ago

this is still pretty insane to us that we had a volume just bricked. would like some further explanation and some credits


jake
EMPLOYEE

7 months ago

Oh, interesting. I didn't follow up


jake
EMPLOYEE

7 months ago

This is 100% what happened here. We patched it right after you reported it. It's definitely obscene; 1:1B+ collision racecase


jake
EMPLOYEE

7 months ago

Went ahead and applied some credits to your account for the issue!


Loading...