Regaining trust in railway after GCP outage
shortcircuit3
PROOP

24 days ago

Hey all, I'm the founder at a small startup.

I recently moved to Railway from Kubernetes because of the management overhead, and Railway's product is very nice and offers exactly what we need. I personally want to stay with Railway, but I'm having a hard time reasoning about if there potentially will be four-hour-plus outages again.

I'm luckily not in the busy time for usage for our company. But if an outage like this were to happen again during the busy time, I would need to figure out some other solution.

How will Railway ensure that something like this doesn't happen again? Do they have a plan in place to prevent downtime like this? As app builders, how are you thinking about it and what are you doing?

Thanks again. I really do love railways product, and I want them to succeed. I simply can't afford downtime like that moving forward. :(

3 Replies

benpoieszhv
PRO

24 days ago

I thought about this for a little bit as well -- but honestly the onus to prevent something like this is on us...

I know that sounds super lame, but ultimately dependency on a single service provider is a single point of failure. For Railway, GCP disabling their account, if that's really what happened... is pretty crazy and not something they could plan around. Other than have a complete secondary hot instance of AWS or Azure for their entire backend that they can hot cut over to in the event of an outage with GCP. Which is entirely unrealistic for a bunch of reasons IMO.

The reason I say it falls to us is then we need to make sure we aren't selecting single point of failures ourselves... and being able to hot cut over to other providers, but in many cases we too will say that's prohibitively expensive and complicated.

In the scheme of things, and as someone who also ran a large infra team...they did a pretty good job all things considered and so I'm not planning on going anywhere (unless they make a habit out of it and the excuses start to get pretty tenuous)


shortcircuit3
PROOP

24 days ago

Other than have a complete secondary hot instance of AWS or Azure for their entire backend that they can hot cut over to in the event of an outage with GCP

Totally. So are you saying every app should have hot instances at other places as well? I know it's likely on a smaller scale than Railway, but isn't that the same thing?

I agree, though. There's not really a way unless you have a fallback host or something like that. But for other reasons, that's problematic. I don't want to have to double my infra costs if not necessary.

What are you doing to ensure you have a fallback in case there's downtime like this again?


Anonymous
HOBBY

23 days ago

Few services seem to exist that are not built completely on one platform. I can work on Azure and use models from OpenAI (and formerly Anthropic) and Mistral, etc. I can't use many bare metal DB or hosted DB or services like Railway that don't have a dependency on a single cloud provider. All of these have had outages picking up lately, as well as these crazy blocking situations like happened with GCP.

That being said - for GCP to block in an automated way the entire platform puts the onus on them! There is no warning, there is no "cease and desist" action with a required mitigating step, there is no grace period, there is just an automated switch turned off. I was literally talking on that day about a potential switch from Azure to GCP, but this situation for Railway has led me to postpone such a decision for at least 1-1.5 years. I can't figure out precisely why AWS us-east seems to go down more often than other regions, but it does. Other services (Anthropic's Claude code, GitHub services and some other things) have gotten more flaky. We are all having to inspect deeper our provider/tool spec to be more resilient. It is tough, and I credit Railway for clearly communicating all they did to mitigate a catastrophic situation. I would suggest legal action as this is a reputational risk they bear due to this, and some customers heavily reliant on Railway should join this. It is not a way to run any service, much less a cloud platform expected to be highly reliable.


Welcome!

Sign in to your Railway account to join the conversation.

Loading...