Meta: Outage Reporting Improvements

5 months ago

Hi everyone,

This is a thread we at Railway didn't wish we had to make, but I think it's one that is worthy to make because of the experience as of late hasn't been up to our standards let alone yours.

In the past, when Railway was significantly smaller, outages were very binary. Either all of Railway was up or down, and it was easier to know the state of the system because, well- everyone was affected.

Today, our builds, network, and our compute are distributed across multi regions, multiple AZs, and multiple points of connection. This is good because it introduces resiliency, and as such, any negative impacts are (relatively) isolated from other events.

Where the drawbacks begin are when a (large) subset of reports come in which makes it difficult to then differentiate the reports between our system and user workloads. For context, the support team handles 2,000+ threads across this platform of varying urgency. This was painful during the week of GitHub throttling when it was unclear which side the issue was on. (Sorry)

Usually the support team then responds in Slack for the Enterprise set of customers, which is fine if you have that relationship but not good if we leave our forums unattended when there genuine customer impact.

So first, we want to say how sorry we are about the delayed time it has taken for Railway to ack. issues on here and Discord. We promise we aren't running around during a fire. ...and second, we want you to know that we are working on an improvement on reporting and investigations on our side.

It's not yet ready, but Chandrika (Support Engineer) has been tuning up a classifier model to start to notice reporting patterns that will then page up the same an Enterprise customer does so we can give everyone here as much updates real time as we look into issues and triage them.

That said, we're gluttons for feedback, so we want to hear what would be the preferred experience. (Ofc us not having impact of course, which the Platform team is working hard on)

40 Replies

athsport

HOBBY

5 months ago

Thank you, it might be good idea to allow reporting without logging in. We couldn't even login last week to report the issue.

hetacharya12

PRO

5 months ago

I 100% agree, I host a small website and engagement on it is extremely time sensitive, my website went down today, I couldn't even login to get logs from Railway. Status website showed everything was operational. This is twice in 15 days, I am affected. Not good.

athsport

Thank you, it might be good idea to allow reporting without logging in. We couldn't even login last week to report the issue.

sojs-coder

PRO

5 months ago

Agreed.

isaiahbuilds

PRO

5 months ago

Maybe put railway on down detector so that if you are down, we have a place to report it that is not down lol

eluchsinger

PRO

5 months ago

Downtime is understandable up to a certain degree - even though as an infrastructure provider, you're essentially selling uptime. Shit happens sometimes.

However, I do expect the status dashboard to be truthful as it's the first place I look if something seems off. If you don't update the status page, it makes the incident twice as bad, because it may kick off whole different processes on our side. I can't stress enough how important that page is.

Anonymous

FREE

5 months ago

100% agreed

joshuadutton

PRO

5 months ago

Thank you for the post and the transparency!

Feedback:

Whatever is happening now should be at the top of the status page. Dashboard is always at the top, but Edge Network should be up there right now because it has the most recent incidents.

I'm also frustrated that the statuses keep saying "resolved after one hour" when the same problem is showing up on multiple consecutive days. You obviously didn't resolve timeouts, DNS issues, or 503 errors the first day that you reported they were solved. The fact that some of these errors were still happened early this morning, I still don't feel certain they are resolved. I am expecting another post ASAP outlining the problems, the solutions, and how you know they aren't going to continue happening. With everything I've seen from you so far, I would surprised if this doesn't happen. Cloudflare also does an excellent job with this.

Attachments

image.png

goleary

PRO

4 months ago

Something that's confusing to me: I see this message that blames current degradation on an upstream provider, but provides no context on why that's assumed to be the case. Github status reports all green, and I'm having no issues with GitHub actions that I'm using directly.

It's opaque (and feels disingenuous) to just say "upstream provider issue" without showing why that's believed to be the case.

Attachments

image.png

1beb

PRO

4 months ago

It happened again today. Got a max connections reached to the railway fronted, my own applications were down and could connect to station.railway.com. The main issue from my standpoint is not the downtime itself - it's that status.railway.com has no mention at all of the downtime.

hetacharya12

PRO

4 months ago

Yes, happened again today, users on my site complaining. Have to port out, software issues happen, but there should be some accountability. This looks like a company that fudges their data to raise money, absolutely no way status dashboard should show no issues in us east region.

angelo-railway

EMPLOYEEOP

4 months ago

Update for everyone here @noah will be working on a new status page as well so we can get the reporting loop to be tighter.

Appreciate everyone's patience.

aleix10kst

HOBBY

4 months ago

Hi,

Not sure if this is the place to post it, but I'm having issues with my deployments. I had a failed deployment yesteday (because of code), but today when I've fixed and tried deploying again, I was seeing lots of issues saying that either my services are misconfigured (missing root directory) or that it can't find the start command. Not sure if there is something that has changed in terms of configuration, but I'm pretty positive it's not on my end.

richmeij

PRO

4 months ago

We are seeing the exact same issue as aleix10kst: we see deployments/builds not triggerd because of a (seemingly) internal misconfiguration. We see errors/suggestions on root paths, but also Railway seems to be trying to run a build/start command from a different service. Case in point: https://railway.com/project/a72849bb-5f98-4c81-82e9-1851653df64e/service/b6bb7119-0c9e-4e5f-92e7-43586ce3cb60?environmentId=684e5b85-b025-4654-a4dc-eee0869d844f

richmeij

PRO

4 months ago

I guess this was related to the whole dockerfile issue: https://status.railway.com/cmmn957fk00a512dzsoks6v40

Tried to redeploy a failed deployment, and it now works correctly.

zicklag

PRO

4 months ago

I just wanted to mention that I appreciate the status page, and it was great to be able to have it send notifications to my Discord #alerts channel I already had setup for stuff like this. ❤

angelo-railway

EMPLOYEEOP

3 months ago

Update for everyone here, we just rolled out a new page:

https://status.railway.com - we also have more to report on general reliability work that we have planned. However, this should end the "gaslighting" feedback that was mentioned since it took us too long to call said incidents.

More to come here.

angelo-railway

Update for everyone here, we just rolled out a new page: <https://status.railway.com> \- we also have more to report on general reliability work that we have planned. However, this should end the "gaslighting" feedback that was mentioned since it took us too long to call said incidents. More to come here.

rati567

PRO

3 months ago

I have upgraded to pro and my deployment is still queued!

krystian-dev548

FREE

3 months ago

too much incidents today.. and the problem are not incidents them self, but the duration.. we can't being blocked due to Railway for HOURS!!

khalilzr

PRO

3 months ago

I've been a paid Railway user for a long time and have recommended it to multiple people, while using it for a lot of my projects. But I think I've reached my limit.

Over the past month alone:

- Multiple downtimes affecting production services

- Deployments regularly taking 20+ minutes

-wrong availability status

This is a production environment. Real users are hitting my API. When deploys take 20 minutes or services go down with no warning, that directly impacts my business and my users' trust.

I get that the team is shipping fast with new features every week, it's impressive from a product velocity standpoint. But what's the point of new features if the core platform isn't reliable? I'd genuinely rather have zero new features for the next 3 months and rock-solid uptime and fast deploys instead. I will give you my money without hesitation

I don't want to migrate. The DX here is the best I've used. But reliability isn't optional for production it's the bare minimum. And right now I'm actively evaluating alternatives because Its just becoming frustrating the downgrade in the service...

Please prioritize stability over feature releases. I think a lot of users here would agree.

athsport

Thank you, it might be good idea to allow reporting without logging in. We couldn't even login last week to report the issue.

Anonymous

PRO

3 months ago

I just ran into this issue today. It's crazy they have no way to contact them when their login system is down. I'm just so thankful I did not switch everything over here at once so now I only have to migrate out one database I brought over here. Not fun to do another migration, but at least it's just one! Whew!

sojs-coder

Agreed.

Anonymous

PRO

3 months ago

Just realizing this thread is from 2 months ago. This just happened to me today. I'm embarrassed I did not do better research on this company before migrating. I guess the skills required to raise $124 million are not the same skills needed to create a basic support ticket and reporting system independent of the outages that need to be reported. 😕

abdussamadbello

PRO

3 months ago

This same scenario happened again yesterday, and I think you need to focus more on support, service reliability, incident detection, and proper monitoring, especially from 3rd part dependencies, the GitHub container repository and CDN, or consider moving your repository from GitHub to your own.

khalilzr

I've been a paid Railway user for a long time and have recommended it to multiple people, while using it for a lot of my projects. But I think I've reached my limit. Over the past month alone: \- Multiple downtimes affecting production services \- Deployments regularly taking 20+ minutes \-wrong availability status This is a production environment. Real users are hitting my API. When deploys take 20 minutes or services go down with no warning, that directly impacts my business and my users' trust. I get that the team is shipping fast with new features every week, it's impressive from a product velocity standpoint. But what's the point of new features if the core platform isn't reliable? I'd genuinely rather have zero new features for the next 3 months and rock-solid uptime and fast deploys instead. I will give you my money without hesitation I don't want to migrate. The DX here is the best I've used. But reliability isn't optional for production it's the bare minimum. And right now I'm actively evaluating alternatives because Its just becoming frustrating the downgrade in the service... Please prioritize stability over feature releases. I think a lot of users here would agree.

isaiahbuilds

PRO

2 months ago

same boat. Im getting tired of stuff just randomly breaking. any alternatives you found. I reallyyyy dont want to go back to AWS or azure. ive used both, and they are rock solid, but a huge pita

ferdinand-soto

FREE

2 months ago

Just so you know I will be switching platforms. If you cannot handle the current business the scale back. But you are unreliable and it is not acceptable.

shiho26miyano

PRO

2 months ago

Expecting alerts when the first investigation case was confirmed and expecting mobile monitoring app!

khalilzr

injung

PRO

2 months ago

I completely agree with this.

We chose Railway because:

the DX is excellent
it allowed us to move fast early on
and we wanted to support the platform

But at this point, reliability is becoming the main blocker.

We started using Railway earlier this year, and have already experienced multiple major incidents (Feb 12, Mar 25, Mar 31, May 4). During these incidents, there's often little we can do other than report them via Discord or the community — and responses can take hours.

I don't mind paying couple more if the platform is stable. However, with this level of incident frequency, it's becoming very difficult to trust Railway for production workloads. We're running a real business with real customers, and these incidents have a direct impact — on the order of $10K/day in losses. This level of instability would simply not be acceptable in environments like AWS.

Another concern is how the Enterprise plan is positioned as the solution. From what we've been told, it requires a minimum spend of around $5,000/month, which is more than 10x our current cost. At that point, it becomes hard to justify staying on Railway at all. We would likely accept the loss in DX and migrate to AWS for more predictable performance and reliability. Honestly, before we even reach a scale where an Enterprise contract makes sense, these repeated reliability issues may push us to leave.

Also +1 on your point:

I'd genuinely rather have zero new features for the next 3 months and rock-solid uptime and fast deploys instead.

khalilzr

dimare

HOBBY

2 months ago

"I'd genuinely rather have zero new features for the next 3 months and rock-solid uptime and fast deploys instead." <-- THIS.

dimare

HOBBY

2 months ago

Time to go back to basics and make sure your actual value props are secure:

- quick, simple deploys

- uptime

Nothing else matters if you can't do that consistently well. Such an opportunity ahead of you but you gotta make sure the foundation is built like a tank.

Other notable needed improvements:

- There's no user reporting capability.

- I've been waiting for 30+ mins for a deploy (still growing) that previously has taken ~3 mins. do you not have a log detection system to sense anomalies, flag to a user, and allow the to report it if it seems wonky? (Would certainly help build trust with your users)

- Your system status says you're fully functional. but things are not working. <-- get this corrected asap. it's just a terrible DX.

angelo-railway

EMPLOYEEOP

2 months ago

Heard all. I wanna push on the whole "no new features" note because the source of the reliability issues isn't a new feature rollout, rather, scaling compute.

Sorta pertinent with the xAI news you may have heard, but the challenge here is making sure that we deal with the 13,000+ new users a day but keeping existing users like you all happy. You may have seen in the last 3 months us getting better, but it's not perfect, this is thankfully due to the firebreaks that we have to taper growth to keep the service quality high.

Keep the feedback coming, we read it all.

hetacharya12

PRO

2 months ago

While the frustration here Angelo isn't that you're making renovations to the ship, it's that there's a big hole that needs immediate attention and when the users point it out, your response and frequency of response is less than poor.

Now what we see as users every week is that you boast about new things built, while at the same time things that would work a week before for us would suddenly start failing.

I hope you understand, being a platform that sells scaling and reliability, boasting about 13000 new users joining affecting that reliability isn't sitting well.

angelo-railway

EMPLOYEEOP

2 months ago

Not intended to boast at all- more so, every platform is getting nailed by uptime, and we own ours, but I think people aren't sensitive to just the strain of every single system in the internet is undergoing right now. We've made some improvements, and we have more work to do.

Already in place is the level of reporting and comms, which the root OP was about the feeling of being gas lit when we deal with an incident. Now that's handled. The next step for us to secure the compute so we can serve demand, simple as, the amount of new users isn't a celebration trust me if it meant that your uptime gets nailed.

hetacharya12

PRO

2 months ago

Appreciate the clarification. Hoping for further reporting improvements.

dimare

HOBBY

2 months ago

If the issue with reliability is securing compute to scale horizontally… adding 13k NEW users/day does not - in any way help - keep your existing (likely/hopefuly growing) customers happy as the system becomes more and more unstable.

I think what some of us are hoping to hear from you is a first principled approach to solving the ever-growing issue (now much more than just reporting downtime and communicating that well).

Possibly… railway needs to stop adding users + load until you stabilize the available compute for the loads your system is currently handling.

Then, gradually onboard new users based on newly available compute and load capacity.

We all know the compute problem is going nowhere, anytime soon.

So - might as well navigate this only-ongoing/growing issue with as solid a reputation as possible (especially important for infrastructure, as you know) for your earlier, paying customer base, rather than piss so may MORE people off as they onboard into an ever-more-unstable system and, possibly loose the race entirely.

Just my 2cents, really don’t mean to offend.

But I think some of us are saying you have to take a much more serious stance to attack the problem, or you risk loosing many of us.

dimare

Angelo… completely respect the strain the internet is under right now. It’s wild. The ones that know, know. However there’s a messaging problem that I think some of us are trying to communicate to you. If the issue with reliability is securing compute to scale horizontally… adding 13k NEW users/day does not - in any way help - keep your existing (likely/hopefuly growing) customers happy as the system becomes more and more unstable. I think what some of us are hoping to hear from you is a first principled approach to solving the ever-growing issue (now much more than just reporting downtime and communicating that well). Possibly… railway needs to stop adding users + load until you stabilize the available compute for the loads your system is currently handling. Then, gradually onboard new users based on newly available compute and load capacity. We all know the compute problem is going nowhere, anytime soon. So - might as well navigate this only-ongoing/growing issue with as solid a reputation as possible (especially important for infrastructure, as you know) for your earlier, paying customer base, rather than piss so may MORE people off as they onboard into an ever-more-unstable system and, possibly loose the race entirely. Just my 2cents, really don’t mean to offend. But I think some of us are saying you have to take a much more serious stance to attack the problem, or you risk loosing many of us.

angelo-railway

EMPLOYEEOP

2 months ago

Not offended at all, and appreciate you sharing it. I feel the same way, but on the flip side, if we go: "Signups closed sorry" that's a whole other can of worms.

Heard on taking a much more serious stance to attack the problem. I have some other todos to handle, but we can share what we've done and what we plan to do other than handwaving.

angelo-railway

Not offended at all, and appreciate you sharing it. I feel the same way, but on the flip side, if we go: "Signups closed sorry" that's a whole other can of worms. Heard on taking a much more serious stance to attack the problem. I have some other todos to handle, but we can share what we've done and what we plan to do other than handwaving.

dimare

HOBBY

2 months ago

People love to hate the love they feel for a waitlist 😉 wild especially once it’s an established well know product.

To boot, (my opinion) clearly explaining to new users “we roll people on, when we have compute available so you never have downtime, never a bad experience” is a really powerful, beautiful, and elevating message from railway.

Either way, I really love what you’ve built. Timing is 💯. Don’t want to see you miss the window

k-cornererp

PRO

2 months ago

I want to say thanks to everyone at the railway team for creating such a great platform designed for ease of use. You guys have helped me generate more revenue from my shopify store with my custom internal apps to manage my customers and I dont know anything about hosting or coding. Best part is connecting the CLI to my opus. Keep it up!

samkotlove

PRO

2 months ago

To add a note that seems to have been lost in recent messages, the first step has to be communication. There's an ongoing issue right now with builds for at least an hour and multiple threads being posted, but no mention anywhere from Railway within these threads or on the status page.

It's one thing to be having issues, but its another to not acknowledge them

abdussamadbello

PRO

2 months ago

It’s been three months since this thread started, and there’s still no improvement in the service’s reliability or uptime. I run an e-invoicing SaaS, and every downtime means losing both money and customers. I’m seriously considering moving critical workloads to another platform with a strict SLA.

samkotlove

angelo-railway

EMPLOYEEOP

2 months ago

From the last three months, the severity of outage has decreased, now the focus on our side is to make it so that we have less of them. Heard on the communication front- we are going to try our best to make it so it's more obvious when you are affected by an issue even if it's not platform wide.

angelo-railway

Not to come off as defensive, when we call an outage, it's usually for a platform wide problem. Builds, although important, aren't something that are make or break for uptime. It's annoying yes, but those issues are usually debugged at the machine level or the code level (esp. with all of the supply chain attacks we've been mitigating.) From the last three months, the severity of outage has decreased, now the focus on our side is to make it so that we have less of them. Heard on the communication front- we are going to try our best to make it so it's more obvious when you are affected by an issue even if it's not platform wide.

samkotlove

PRO

2 months ago

While I would disagree with you that builds aren't something that should make or break for uptime (imagine I had a bad image up and can't deploy a fix) especially since its core to the Railway product, the problem I'm pointing out is that the status page is setup to acknowledge events like these. Not having a degraded performance today is the type of communication that people are complaining about

Attachments

image.png

Welcome!