Meta: Outage Reporting Improvements

Hi everyone,

This is a thread we at Railway didn't wish we had to make, but I think it's one that is worthy to make because of the experience as of late hasn't been up to our standards let alone yours.

In the past, when Railway was significantly smaller, outages were very binary. Either all of Railway was up or down, and it was easier to know the state of the system because, well- everyone was affected.

Today, our builds, network, and our compute are distributed across multi regions, multiple AZs, and multiple points of connection. This is good because it introduces resiliency, and as such, any negative impacts are (relatively) isolated from other events.

Where the drawbacks begin are when a (large) subset of reports come in which makes it difficult to then differentiate the reports between our system and user workloads. For context, the support team handles 2,000+ threads across this platform of varying urgency. This was painful during the week of GitHub throttling when it was unclear which side the issue was on. (Sorry)

Usually the support team then responds in Slack for the Enterprise set of customers, which is fine if you have that relationship but not good if we leave our forums unattended when there genuine customer impact.

So first, we want to say how sorry we are about the delayed time it has taken for Railway to ack. issues on here and Discord. We promise we aren't running around during a fire. ...and second, we want you to know that we are working on an improvement on reporting and investigations on our side.

It's not yet ready, but Chandrika (Support Engineer) has been tuning up a classifier model to start to notice reporting patterns that will then page up the same an Enterprise customer does so we can give everyone here as much updates real time as we look into issues and triage them.

That said, we're gluttons for feedback, so we want to hear what would be the preferred experience. (Ofc us not having impact of course, which the Platform team is working hard on)

15 Replies

athsport
HOBBY

a month ago

Thank you, it might be good idea to allow reporting without logging in. We couldn't even login last week to report the issue.


hetacharya12
PRO

a month ago

I 100% agree, I host a small website and engagement on it is extremely time sensitive, my website went down today, I couldn't even login to get logs from Railway. Status website showed everything was operational. This is twice in 15 days, I am affected. Not good.


athsport

Thank you, it might be good idea to allow reporting without logging in. We couldn't even login last week to report the issue.

sojs-coder
PRO

25 days ago

Agreed.


isaiahbuilds
PRO

25 days ago

Maybe put railway on down detector so that if you are down, we have a place to report it that is not down lol


25 days ago

Downtime is understandable up to a certain degree - even though as an infrastructure provider, you're essentially selling uptime. Shit happens sometimes.

However, I do expect the status dashboard to be truthful as it's the first place I look if something seems off. If you don't update the status page, it makes the incident twice as bad, because it may kick off whole different processes on our side. I can't stress enough how important that page is.


Anonymous
FREE

23 days ago

100% agreed


joshuadutton
PRO

23 days ago

Thank you for the post and the transparency!

Feedback:

Whatever is happening now should be at the top of the status page. Dashboard is always at the top, but Edge Network should be up there right now because it has the most recent incidents.

I'm also frustrated that the statuses keep saying "resolved after one hour" when the same problem is showing up on multiple consecutive days. You obviously didn't resolve timeouts, DNS issues, or 503 errors the first day that you reported they were solved. The fact that some of these errors were still happened early this morning, I still don't feel certain they are resolved. I am expecting another post ASAP outlining the problems, the solutions, and how you know they aren't going to continue happening. With everything I've seen from you so far, I would surprised if this doesn't happen. Cloudflare also does an excellent job with this.

Attachments


18 days ago

Something that's confusing to me: I see this message that blames current degradation on an upstream provider, but provides no context on why that's assumed to be the case. Github status reports all green, and I'm having no issues with GitHub actions that I'm using directly.
It's opaque (and feels disingenuous) to just say "upstream provider issue" without showing why that's believed to be the case.


1beb
PRO

14 days ago

It happened again today. Got a max connections reached to the railway fronted, my own applications were down and could connect to station.railway.com. The main issue from my standpoint is not the downtime itself - it's that status.railway.com has no mention at all of the downtime.


hetacharya12
PRO

14 days ago

Yes, happened again today, users on my site complaining. Have to port out, software issues happen, but there should be some accountability. This looks like a company that fudges their data to raise money, absolutely no way status dashboard should show no issues in us east region.


Update for everyone here @noah will be working on a new status page as well so we can get the reporting loop to be tighter.

Appreciate everyone's patience.


aleix10kst
HOBBY

4 days ago

Hi,

Not sure if this is the place to post it, but I'm having issues with my deployments. I had a failed deployment yesteday (because of code), but today when I've fixed and tried deploying again, I was seeing lots of issues saying that either my services are misconfigured (missing root directory) or that it can't find the start command. Not sure if there is something that has changed in terms of configuration, but I'm pretty positive it's not on my end.


richmeij
PRO

4 days ago

We are seeing the exact same issue as aleix10kst: we see deployments/builds not triggerd because of a (seemingly) internal misconfiguration. We see errors/suggestions on root paths, but also Railway seems to be trying to run a build/start command from a different service. Case in point: https://railway.com/project/a72849bb-5f98-4c81-82e9-1851653df64e/service/b6bb7119-0c9e-4e5f-92e7-43586ce3cb60?environmentId=684e5b85-b025-4654-a4dc-eee0869d844f


richmeij
PRO

4 days ago

I guess this was related to the whole dockerfile issue: https://status.railway.com/cmmn957fk00a512dzsoks6v40

Tried to redeploy a failed deployment, and it now works correctly.


3 days ago

I just wanted to mention that I appreciate the status page, and it was great to be able to have it send notifications to my Discord #alerts channel I already had setup for stuff like this. heart emoji


Loading...