My production apps fail to communicate to their backends since today

xytric
PRO

8 months ago

Several services hosted on Railway, which have been running for months, fail to communicate to each other since this morning. Weirdly enough the API's can be accessed externally, but not from other Railway services it seems. The frontend apps use the public domains of the backend services.

Awaiting User Response

19 Replies

xytric
PRO

8 months ago

Now it works sometimes. But not reliably.


xytric
PRO

8 months ago

Still very unstable!


xytric
PRO

8 months ago

Service-to-service communication using public URLs for some of my projects has been unstable for nearly half a day, failing most of the time. This has caused issues for my customers, and they're not happy about it.

I really like Railway as a platform, but the reliability issues have been frustrating. Can you provide an update?


8 months ago

Could you let us know why you're using public URLs for communicating between services?


Status changed to Awaiting User Response Railway 8 months ago


xytric
PRO

8 months ago

I used public URLs for communication between services because it felt like the most logical choice at the time. Most of my APIs are public-facing anyway so IoT devices can access them, and they’re all authenticated. This setup worked fine for over six months, so I didn’t see a reason to change it.

If switching to internal URLs for service-to-service communication is better and will solve the issue, I’m happy to make that change.


Status changed to Awaiting Railway Response Railway 8 months ago


8 months ago

Some timestamp for the public issues would be great. Generally though, internal will be better since it'll be faster and you won't pay for egress fees


Status changed to Awaiting User Response Railway 8 months ago


xytric
PRO

8 months ago

Timestamps well there's plenty of those. Pretty much every time fetch failed during the day for example for this service:
61ddf80b-2a07-4536-bb3e-bc04ee26d0ab

Starting at 7:52 GMT+1. Sometimes it went through then the frontend would be able to fetch data from the backend for 5 minutes before timing out again.

Now fetching from an external rest client or from any of the IoT devices was no issue.


Status changed to Awaiting Railway Response Railway 8 months ago



Status changed to Awaiting User Response Railway 8 months ago


xytric
PRO

8 months ago

Thanks for looking into this, but I’m honestly not convinced the issue is with my application. Let me explain why.

I’ve had one service running without issues for over six months, and another one has been live even longer than that. This particular app hasn’t been touched in 1.5 months, no updates, no changes, nothing. The only thing I’ve done just now was try to enable private domains, but that didn’t work out of the box. So, I rolled everything back immediately, leaving things exactly as they were before.

Given all this, I have a hard time believing the problem is on my end. Since it’s a Next.js app fetching all data from the server (using RSCs), the HTTP logs are unlikely to reveal much.

Still, if there’s something specific I should look into, let me know, I’m happy to check! But it feels like the root cause might be elsewhere.

For now it still seems like it is unstable, so I will try to get private domains working in the meantime in the hopes that it will fix things.


Status changed to Awaiting Railway Response Railway 8 months ago


xytric
PRO

8 months ago

Ironically it seems that only my Next.js apps using RSCs are affected. My Remix app works just fine using the exact same method fetching from a public URL (Remix also fetches serverside).


xytric
PRO

8 months ago

Could this have anything to do with the following line from the changelog on Friday?

• Improved DDoS mitigation measures to our global network infrastructure.

Since today is the first workday after the update, my customers wouldn’t have noticed any issues over the weekend. It makes sense that they’re only reporting problems today. 

It almost seems like my apps are prevented from sending any requests after a while.

At one moment it works for a good bit, then after leaving it on for a while polling new data every x seconds it just fails to fetch anything from the backend url and all subsequent data loading fails. Then after waiting set period of time it suddenly works again. Once again, fetching directly from the backend from a HTTP client works fine during this period.


8 months ago

Not so- since the behavior of your app is dependent on the framework, it doesn't seem to be an issue with the Railway network. A Railway proxy issue would indiscriminately affect your requests public and private.

We checked our proxy logs internally at those timestamps and found nothing of note- you also have those same logs via the HTTP Logs section in your application.

Were there any changes on the Next.js side that would introduce something like this? (Unlikely, but I wanna get you to have confidence on your app on Railway again.)


Status changed to Awaiting User Response Railway 8 months ago


xytric
PRO

8 months ago

Well, I have not changed the Next.js version, package versions of any of those apps, or the code for that matter for the past months up until yesterday, where the only thing I did was change an env variable name and try to log anything useful.

The reason you don't see anything in the proxy logs is because the requests couldn't even seem to reach the server. And after a while they did and everything went back to normal. That happened pretty much all day and night yesterday.

Now it seems to work again. I have switched one app over to use the private network and the other two remain the same with the exact same configuration and code, save from an environment variable rename.


Status changed to Awaiting Railway Response Railway 8 months ago


8 months ago

Understood, if we don't see the request, then if you have logs from the upstream provider that can help us as well.

But in the meantime, we'll keep our eyes peeled. Feel free to raise again if you see it immediately.


Status changed to Awaiting User Response Railway 8 months ago


xytric
PRO

8 months ago

Still happening unfortunately, this morning fetch from the frontend failed for 15 minutes, and yesterday a bunch of times too. Then just right now it started working again. Deployment in question:
f38cbe6d-e39f-4100-8e1b-c0f971772d75

Looks like i'm not the only one with this issue. Another thread with the same issue was opened yesterday:
https://help.railway.com/questions/last-night-cet-random-fetch-failed-5d71a0b7

I will try and move that service to the internal network, see if that fixes things.


Status changed to Awaiting Railway Response Railway 8 months ago


xytric
PRO

8 months ago

That did definitely not fix things, if anything it made things even worse only successfully reaching the backend after retrying for 4-5 times. Reverted to using the public url again.

Deployment that kept on failing to fetch: 2e021c4f-a990-4fad-8b1b-e3da2ca91421
And when checking the backend logs, every time a fetch failed it simply failed to reach the backend. Showing no http logs or any logs whatsoever on the backend, even over the private network.

I want to clarify that this app worked fine for over a year, with the last dependency update for Next.js being over 2 months ago. So no I don't think the issue is application related. If it would be, then at least fetch would get through to the backend, as I'm not doing anything network or proxy related. All i'm doing is calling fetch to the backend from mostly Next server actions and RSCs.


8 months ago

Acking the report, however, if the failure occurs, it's not the proxy then- as the private network doesn't a system to forward packets.

It can be a number of things such as load and other factors, I would try to add more profiling and logs to your application.


Status changed to Awaiting User Response Railway 8 months ago


xytric
PRO

8 months ago

After reviewing a suggestion from Brody in another thread, I discovered that the issues I had originated from still being on the legacy runtime. I had assumed all apps had already been migrated automatically, which led to some confusion. It's strange that the setup was stable for so long until problems started surfacing this week. I should have probably switched over earlier as that would have saved me a few days of frustration.

The issue is now resolved, thanks for your efforts in helping troubleshoot!


Status changed to Awaiting Railway Response Railway 8 months ago


8 months ago

Can't wait till that option is gone...

Glad to see it resolved.


Status changed to Awaiting User Response Railway 8 months ago