5 months ago
Hey there!
Last night, around 22-23 CET, our app pretty much died for a period of about 30-45 minutes. We frantically tried to diagnose and fix, but weren't able to find out what's wrong. The symptoms were:
Consistent, nondescript "Fetch failed" errors when our frontend server (node application) requested data from our API server, while direct CURL requests to our API worked fast as usual and without issue
Zero logs whatsoever appearing for our API deployment, neither in "Deploy logs", nor "HTTP logs", even though both successful and failed requests were being made at the time
Our API is public, and our frontend server uses the API's public deployment URL to connect to it. We previously tried switching to "internal" networking for this connection, but ran into a bunch of finicky issues with that, so ended up opting for the public URL. But that's an issue for another time. For now, we're just trying to understand what happened. We were assuming that something in our applications itself was causing the problem, but a few things now point towards the issue perhaps having been something (network-related?) on railway's side?
As I mentioned above, at the time of the issue the Railway dashboard was not displaying any HTTP or application logs at all for our API, even while we were actively making (successful and failed) requests
Queries to our API from our frontend server failed, while identical requests made from our local devices via CURL were successful
All metrics for all our services were green within the expected ranges
The issue eventually resolved itself without any action from on our end
So my question is: Are you aware of any potential short-lived platform regressions around that time? Or maybe you've seen something similar happen and could help us diagnose what might be wrong? Because so far we're pretty stumped — the "Fetch failed" errors aren't exactly descriptive, and from what we can tell nothing we can control went off the rails.
Thanks!
2 Replies
5 months ago
Hey, I had the exact same issue! I almost thought I was the only one. Most of my frontend services were unstable all day yesterday, with fetch failing for indeed periods 15-30 minutes then working again for a brief period, and then failing again. This repeated throughout the day.
The code for my apps hadn't been touched for months so I very much assume this problem is with Railway.
5 months ago
Phew, good to know we weren't the only ones either! Just checked in with the team, and things sound consistent with your experience. We also saw some seemingly random failed requests throughout the day, even before the described extended downtime hit us late at night. Would be great if someone could confirm that this was indeed an issue on Railway's side, and if yes shed some light on this and understand what happened, why it didn't appear on the status page, why it clearly didn't affect everyone etc.
It seems like whatever it is is still happening today, albeit much less frequently. On a fresh PR-created environment, we saw our app unable to communicate with the API deployment through its public URL again, this time due to connection timeouts. Meanwhile, as before, direct requests from outside Railway to the API's public URL work fine. Just as yesterday, there are no failed requests in the API's error logs. A few hours later when we re-deployed the app service, the problem disappeared.
In the screenshot, you can see the build step of our app failing, which requires a connection to the API visible on the left side via its public URL in order to build GQL-related types. It retries this multiple times, so that the API deployment has ample time to start up. In this case, the API had been up and ready for requests according to logs since 11:52, but connecting to it from within the app deployment failed at 11:58. Something definitely seems fishy here 🤔
Spotted another odd thing: Another one of our staging services suddenly died yesterday with another "fetch failed", while trying to call an external API. We've never seen this before in over 1 year of this service making a request once every 5 seconds. It all seems to point towards some issue where services within Railway intermittently appear to be unable to reach hosts outside of Railway?
Hey @Brody very sorry to tag you directly but it's been over 24 hours since I created the thread. We're getting a few errors on prod again today now and not at all sure what to do… it affects multiple completely different applications we have deployed. Every now and then, fetches to external URLs, whether those are public URLs to other services on Railway or external APIs, will just fail with "Fetch failed"
5 months ago
Hello, there's nothing to indicate that there's any networking issue on our side.
Though you should check if you are on the legacy runtime.
our app is down again now, fetch failed errors for almost every request going to external APIs… 😵💫 we're at a loss, this is just making normal POST requests to an external API which is up and works just fine. And this literally never happened before 3 days or so ago, and also is not reproducible when running our production build containers locally. only way out i see right now is an emergency migration away from Railway tbh…
@Brody How can I check that? In the affected service settings I don't see anything related to legacy runtime so I assume we're not on the legacy runtime.
Right now, consistently again, our services cannot connect to this endpoint in the first screenshot, for example. The requests just time out every time from within the container.
Second screenshot is me making a request to the same endpoint from my local machine and it working fine. At this point I don't see any other explanation other than some kind of networking issue…?
So tbc — same exact container, built with same environment, locally: can connect fine to external APIs, no problem. Built & deployed on Railway: Fails with "Fetch failed" for every request to an external URL
https://github.com/nodejs/undici/issues/2990#issuecomment-2408883876
☝️ Found this — it suggests increasing --network-family-autoselection-attempt-timeout
in NODE_OPTIONS. This is honestly a bit over my head, but we'll give it a try.
after bumping --network-family-autoselection-attempt-timeout
to 1 whole second, we haven't had any fetch failed
errors anymore! I wonder if what's happening is just something in the network path occasionally slowing down under load enough to hit this low 250ms default timeout — but this kind of networking stuff is honestly over my head. Good thing is so far the problem appears resolved, but we're continuing to monitor…
5 months ago
what external API are you calling?
a few. can't say for sure at the moment whether the issue affects all of them, but it's definitely more than one. The main ones are https://infura.io, https://alchemy.com, and during the periods where we had the problem they were both consistently unavailable for us.
it also affected API calls to our own APIs hosted within the same Railway environment though. We use the deployment's "external URL"s for our frontend SSR server to talk to our API for example, and that also caused Fetch failed
errors. I'm aware we really should be using internal networking, but as i mentioned already we had quite some trouble getting that to work, so we decided to use public URLs for now.
Still no more failed fetches since we bumped that node network timeout btw…
Hey - just wanted to say that I am having the same kind of issues with my project. Not sure why it happens.
Anyway, thanks for the tip with the --network-family… will give it a shot
Unfortunately, with the network family thing set to 1 second, both of our frontend services are down again today. Same symptom, fetch failed
😦
everything was fine overnight, the problems just started again. It seems to be at least somewhat cyclical…
Some more things we noticed, but take with a grain of salt since it's so hard to pin down: It seems like when a deploy is triggered for a particular service, while building & deploying the new version, the one that's still currently online is a lot more likely to experience fetch failed errors. And after a new container comes online, it seems to usually be fine for at least a short while before experiencing the fetch failed problem again.
5 months ago
total random guess here, but have you seen that infura had some incidents (20 hours outage over 2 days) on the JSON-RPC API, which is what I believe you are using in your postman https://status.infura.io/
Yeah, I did. They have outages every now and then, which is why we implemented automatic failover to Alchemy, which offers the same services. When our service struggles to connect to Infura, it also fails to connect to Alchemy — both with "Fetch failed", seemingly due to connection timeouts. It also wouldn't explain why we get the same failures to connect to our other services deployed on Railway
Oh and just tbc, on Postman I was querying an endpoint on our app that just proxies requests to Alchemy / Infura and injects an access token. Querying Infura at the time directly worked — so the failure was with our app service trying to fetch from Infura.
Here's another example that happened just now on a fresh PR preview environment… The API service is up, but the app during its build cannot connect to it via public URL due to ETIMEDOUT
. Previously when this happened, trying again a few moments later would make it randomly work.
… Yep, just hit "redeploy" on the failed build, and it worked. Our build step tries connecting to the API 10 times… This time it failed 3 times, then randomly worked on the 4th. Nothing changed about the API in the meantime.
Good morning! Unsure if this is directly related but over the weekend we also saw one of our services drop connection to a DB due to a timeout error. The service is a node application running Sequelize as ORM, the DB is the default postgres template from Railway. Similarly to the other issues, this never happened before in over a year of running this service with equivalent configuration. Important to note that this connection was using internal private networking, not a public URL.
Unable to connect to the database: SequelizeConnectionError: connect ETIMEDOUT fd12:aa66:dec6:0:4000:a:164d:9420:5432 in SequelizeConnectionError: connect ETIMEDOUT fd12:aa66:dec6:0:4000:a:164d:9420:5432
… Plot thickens. Two minutes after that (Sunday 2:07 AM CET), another service in another testing environment died because it was suddenly unable to connect to a default Redis template instance. The environments are completely separate and we had no deploys or config changes over the weekend at all.
More and more builds randomly failing today due to ETIMEDOUT
when trying to connect to other services. Most of the time it doesn't work, then suddenly it does. As always, the API that it's trying to connect here is up and responding to the exact same query that's being done in the build step just fine, and fast. Either we're missing something big time or there's a serious problem with networking somewhere… :/ Would really appreciate some help at this point.
5 months ago
I'm really not sure what you want me to say here, there's been no issues from our side.
that request was done on the public network, so please check the http logs for it.