a month ago
Hi everyone,
This is a thread we at Railway didn't wish we had to make, but I think it's one that is worthy to make because of the experience as of late hasn't been up to our standards let alone yours.
In the past, when Railway was significantly smaller, outages were very binary. Either all of Railway was up or down, and it was easier to know the state of the system because, well- everyone was affected.
Today, our builds, network, and our compute are distributed across multi regions, multiple AZs, and multiple points of connection. This is good because it introduces resiliency, and as such, any negative impacts are (relatively) isolated from other events.
Where the drawbacks begin are when a (large) subset of reports come in which makes it difficult to then differentiate the reports between our system and user workloads. For context, the support team handles 2,000+ threads across this platform of varying urgency. This was painful during the week of GitHub throttling when it was unclear which side the issue was on. (Sorry)
Usually the support team then responds in Slack for the Enterprise set of customers, which is fine if you have that relationship but not good if we leave our forums unattended when there genuine customer impact.
So first, we want to say how sorry we are about the delayed time it has taken for Railway to ack. issues on here and Discord. We promise we aren't running around during a fire. ...and second, we want you to know that we are working on an improvement on reporting and investigations on our side.
It's not yet ready, but Chandrika (Support Engineer) has been tuning up a classifier model to start to notice reporting patterns that will then page up the same an Enterprise customer does so we can give everyone here as much updates real time as we look into issues and triage them.
That said, we're gluttons for feedback, so we want to hear what would be the preferred experience. (Ofc us not having impact of course, which the Platform team is working hard on)
15 Replies
a month ago
Thank you, it might be good idea to allow reporting without logging in. We couldn't even login last week to report the issue.
a month ago
I 100% agree, I host a small website and engagement on it is extremely time sensitive, my website went down today, I couldn't even login to get logs from Railway. Status website showed everything was operational. This is twice in 15 days, I am affected. Not good.
athsport
Thank you, it might be good idea to allow reporting without logging in. We couldn't even login last week to report the issue.
25 days ago
Agreed.
25 days ago
Maybe put railway on down detector so that if you are down, we have a place to report it that is not down lol
25 days ago
Downtime is understandable up to a certain degree - even though as an infrastructure provider, you're essentially selling uptime. Shit happens sometimes.
However, I do expect the status dashboard to be truthful as it's the first place I look if something seems off. If you don't update the status page, it makes the incident twice as bad, because it may kick off whole different processes on our side. I can't stress enough how important that page is.
23 days ago
100% agreed
23 days ago
Thank you for the post and the transparency!
Feedback:
Whatever is happening now should be at the top of the status page. Dashboard is always at the top, but Edge Network should be up there right now because it has the most recent incidents.
I'm also frustrated that the statuses keep saying "resolved after one hour" when the same problem is showing up on multiple consecutive days. You obviously didn't resolve timeouts, DNS issues, or 503 errors the first day that you reported they were solved. The fact that some of these errors were still happened early this morning, I still don't feel certain they are resolved. I am expecting another post ASAP outlining the problems, the solutions, and how you know they aren't going to continue happening. With everything I've seen from you so far, I would surprised if this doesn't happen. Cloudflare also does an excellent job with this.
Attachments
18 days ago
Something that's confusing to me: I see this message that blames current degradation on an upstream provider, but provides no context on why that's assumed to be the case. Github status reports all green, and I'm having no issues with GitHub actions that I'm using directly.
It's opaque (and feels disingenuous) to just say "upstream provider issue" without showing why that's believed to be the case.
14 days ago
It happened again today. Got a max connections reached to the railway fronted, my own applications were down and could connect to station.railway.com. The main issue from my standpoint is not the downtime itself - it's that status.railway.com has no mention at all of the downtime.
14 days ago
Yes, happened again today, users on my site complaining. Have to port out, software issues happen, but there should be some accountability. This looks like a company that fudges their data to raise money, absolutely no way status dashboard should show no issues in us east region.
5 days ago
Update for everyone here @noah will be working on a new status page as well so we can get the reporting loop to be tighter.
Appreciate everyone's patience.
4 days ago
Hi,
Not sure if this is the place to post it, but I'm having issues with my deployments. I had a failed deployment yesteday (because of code), but today when I've fixed and tried deploying again, I was seeing lots of issues saying that either my services are misconfigured (missing root directory) or that it can't find the start command. Not sure if there is something that has changed in terms of configuration, but I'm pretty positive it's not on my end.
4 days ago
We are seeing the exact same issue as aleix10kst: we see deployments/builds not triggerd because of a (seemingly) internal misconfiguration. We see errors/suggestions on root paths, but also Railway seems to be trying to run a build/start command from a different service. Case in point: https://railway.com/project/a72849bb-5f98-4c81-82e9-1851653df64e/service/b6bb7119-0c9e-4e5f-92e7-43586ce3cb60?environmentId=684e5b85-b025-4654-a4dc-eee0869d844f
4 days ago
I guess this was related to the whole dockerfile issue: https://status.railway.com/cmmn957fk00a512dzsoks6v40
Tried to redeploy a failed deployment, and it now works correctly.
3 days ago
I just wanted to mention that I appreciate the status page, and it was great to be able to have it send notifications to my Discord #alerts channel I already had setup for stuff like this. 


