Internal network limitations
brunohiis
PROOP

7 months ago

Hey!
We are doing some load testing to make sure that our setup is using all the resources it has.
However, we have found an interesting case where the internal networking seems to die in certain cases.
E.g. when we have our backend running in machine A and try to query it from machine B via the internal network (e.g. curl backend:5000) we get "Failed to connect to backend port 5000 after 142 ms: Could not connect to server". (This is under a load test, works fine normally).
However, in machine A (where the backend is running) we are successfully able to fetch response from localhost during that time.
Are there any limitations for the internal network that we should be aware of?

34 Replies

brunohiis
PROOP

7 months ago

fd1a1c68-5bf0-4d48-afaa-fa810942c096


brody
EMPLOYEE

7 months ago

By "machine" do you mean "service"?


brunohiis
PROOP

7 months ago

yes, sorry


brunohiis
PROOP

7 months ago

1388612452689908000


brody
EMPLOYEE

7 months ago

That's for public connections.


brody
EMPLOYEE

7 months ago

What kind of service is service A?


brunohiis
PROOP

7 months ago

It's running a NodeJS backend managed by PM2 in cluster mode


brunohiis
PROOP

7 months ago

I think it's using like 30 clusters


brody
EMPLOYEE

7 months ago

Aka 30 instances of the application?


brunohiis
PROOP

7 months ago

Yup


brody
EMPLOYEE

7 months ago

What was your vCPU during load testing?


brunohiis
PROOP

7 months ago

very low, it hit like 5 cpu cores max


brunohiis
PROOP

7 months ago

it should use 1 cpu per cluster as the max


brody
EMPLOYEE

7 months ago

Then it seems like you are CPU starved.


brunohiis
PROOP

7 months ago

1388613103108751400


brunohiis
PROOP

7 months ago

we have set this as the resource limit tho


brunohiis
PROOP

7 months ago

frontend seems to scale fine up to 25 cores with basically the same setup


brunohiis
PROOP

7 months ago

frontend is running a basic nextjs app in a similar cluster mode as backend


brody
EMPLOYEE

7 months ago

Are you sure pm2 is distributing the incoming requests to all 30 instances?


brunohiis
PROOP

7 months ago

yes, we have the same setup in our current production server and it's working perfectly


brody
EMPLOYEE

7 months ago

Then your backend should be able to vertically scale to about 30 vCPU, so something in code is a bottleneck.


rohukas
PRO

7 months ago

Hopping in as well.
Our main problem is networking between the two services: ingress-proxy <-> backend.

During load testing ingress-proxy loses connection to backend.

Example:
/ (ingress-proxy) # curl backend.railway.internal:5000
curl: (7) Failed to connect to backend.railway.internal port 5000 after 280 ms: Could not connect to server

The backend instance is perfectly fine and running at that point when running commands in it:
/ (backend) # curl localhost:5000
{"message":"Organization not found","success":false}


rohukas
PRO

7 months ago

Its as if the connection between the two boxes is cut off after a certain limit.


brody
EMPLOYEE

7 months ago

Could you give me a sense of what kind of load you are putting on the backend during this? RPS?


brunohiis
PROOP

7 months ago

1388614014371893500


brunohiis
PROOP

7 months ago

This is sent to ingress-proxy service, which forwards most of the stuff to the backend service


brunohiis
PROOP

7 months ago

Should be around 2k rps during the peak of the load test


brody
EMPLOYEE

7 months ago

There's no set RPS limit via the internal network fwiw.


brunohiis
PROOP

7 months ago

Hmm okay, we'll run some additional tests to gather more info


brody
EMPLOYEE

7 months ago

The endpoints that are being hit (not by curl), are they doing db operations?

And admittedly, "could not connect to server" isn't too informative, could you see about getting a far more verbose error?


rohukas
PRO

7 months ago

Yeah, gonna run tons of tests. I think maybe the backend VPS has some bad network configs. Stuff like available sockets being exhausted etc


brunohiis
PROOP

7 months ago

since the frontend service is working fine under the load it's 99% some shit on our end, we'll play around and provide an update soon


passos
MODERATOR

7 months ago

btw, any reason to not use Railway's replica system over pm2?


brody
EMPLOYEE

7 months ago

There is always the possibility it's on our end, but in all my years of doing support for Railway, I've never encountered someone running into any sort of RPS limits on the private network, so please let me know what you find.


Loading...