9 months ago
We see a railway automatic deployment of 2 postgres instances in our environment around 50 minutes ago that completely wiped the volumes. Our service immediately started alarming us with errors that table_xyz does not exist.
This was our staging environment so only our dev teams were impacted but we are very concerned about this happening in production.
Project id: 979d8e07-4c99-427d-a247-04265f375550
EnvironmentId c39c1bd0-9903-48d6-a25f-1f1d02a5eda7
36 Replies
I would like to get some sense of what went on here for me to know how to prevent this in prod
9 months ago
Hey, is your database still wiped?
9 months ago
And yes, unfortunately this was Railway's fault.
We've identified that a component in our deployment infrastructure became unresponsive, which caused your deployment to hang before eventually failing. We sincerely apologize for the frustration this caused.
We've now implemented additional monitoring to detect this type of issue immediately, so our team can resolve it much faster if it happens again.9 months ago
Do you want to recover it? I can raise it to the team.
yes is that possible. its unclear to me how hang in the deployment causes a disk wipe. is there more i should be doing in production environment to prevent against this. this would have been very costly to us in production.
9 months ago
I'll raise this to team, unfortunately I do not know most of the answers to your question.
cc @Brody
9 months ago
Hello,
The issue the two other users faced would not cause any data loss.
9 months ago
Oh my bad then, ignore what I said.
9 months ago
Should I be looking at tax-postgres?
we did not see the hung deployment. deployments were successful but just all the data was gone
9 months ago
Looks like ponder-db was last deployed around 2 months ago? should I be looking at another environment? I am looking at staging right now.
9 months ago
The team is looking into this now.
9 months ago
Would you mind checking now?
9 months ago
I believe it's been re-initialized correctly (happy to cover what happened after and what we'll do to fix it in a sec)
yes looks like data has been restored. reminder to encrypt at rest seeing u guys poke around in the table 😂
looking forward to full diagnostic and some advice on protecting our production services from such issues
9 months ago
Everything is encrypted at rest on our side!
9 months ago
Yes. We have a BAA for enterprise that will make it so that someone internally needs your consent
But, in this case, I actually just went off metrics
9 months ago
What happened, and I have to figure out why is that you had 2 identical volumes (as in, they had the same unique identifier)
However one of them was older. I think it might have picked an older snapshot. Why? Ill dig in
9 months ago
Yup
would love a full post mortem here and some advice on protecting ourselves from such events in prod. thanks
9 months ago
Yup. We believe this occurs when you have backups and a migration
The backup fires, the migration grabs the new partial as gospel, and then migrates that
9 months ago
Will get confirmation for this today
9 months ago
!remind me to follow up in 4 hours
9 months ago
(Latest)
9 months ago
just a heads up that he left the server
this is still pretty insane to us that we had a volume just bricked. would like some further explanation and some credits
8 months ago
Oh, interesting. I didn't follow up
8 months ago
This is 100% what happened here. We patched it right after you reported it. It's definitely obscene; 1:1B+ collision racecase
8 months ago
Went ahead and applied some credits to your account for the issue!
