Internal network limitations
brunohiis
PROOP

a year ago

Hey!

We are doing some load testing to make sure that our setup is using all the resources it has.

However, we have found an interesting case where the internal networking seems to die in certain cases.

E.g. when we have our backend running in machine A and try to query it from machine B via the internal network (e.g. curl backend:5000) we get "Failed to connect to backend port 5000 after 142 ms: Could not connect to server". (This is under a load test, works fine normally).

However, in machine A (where the backend is running) we are successfully able to fetch response from localhost during that time.

Are there any limitations for the internal network that we should be aware of?

34 Replies

brunohiis
PROOP

a year ago

fd1a1c68-5bf0-4d48-afaa-fa810942c096


a year ago

By "machine" do you mean "service"?


brunohiis
PROOP

a year ago

yes, sorry


brunohiis
PROOP

a year ago

1388612452689907902


a year ago

That's for public connections.


a year ago

What kind of service is service A?


brunohiis
PROOP

a year ago

It's running a NodeJS backend managed by PM2 in cluster mode


brunohiis
PROOP

a year ago

I think it's using like 30 clusters


a year ago

Aka 30 instances of the application?


brunohiis
PROOP

a year ago

Yup


a year ago

What was your vCPU during load testing?


brunohiis
PROOP

a year ago

very low, it hit like 5 cpu cores max


brunohiis
PROOP

a year ago

it should use 1 cpu per cluster as the max


a year ago

Then it seems like you are CPU starved.


brunohiis
PROOP

a year ago

1388613103108751380


brunohiis
PROOP

a year ago

we have set this as the resource limit tho


brunohiis
PROOP

a year ago

frontend seems to scale fine up to 25 cores with basically the same setup


brunohiis
PROOP

a year ago

frontend is running a basic nextjs app in a similar cluster mode as backend


a year ago

Are you sure pm2 is distributing the incoming requests to all 30 instances?


brunohiis
PROOP

a year ago

yes, we have the same setup in our current production server and it's working perfectly


a year ago

Then your backend should be able to vertically scale to about 30 vCPU, so something in code is a bottleneck.


rohukas
PRO

a year ago

Hopping in as well.

Our main problem is networking between the two services: ingress-proxy <-> backend.

During load testing ingress-proxy loses connection to backend.

Example:

/ (ingress-proxy) # curl backend.railway.internal:5000

curl: (7) Failed to connect to backend.railway.internal port 5000 after 280 ms: Could not connect to server

The backend instance is perfectly fine and running at that point when running commands in it:

/ (backend) # curl localhost:5000

{"message":"Organization not found","success":false}


rohukas
PRO

a year ago

Its as if the connection between the two boxes is cut off after a certain limit.


a year ago

Could you give me a sense of what kind of load you are putting on the backend during this? RPS?


brunohiis
PROOP

a year ago

1388614014371893422


brunohiis
PROOP

a year ago

This is sent to ingress-proxy service, which forwards most of the stuff to the backend service


brunohiis
PROOP

a year ago

Should be around 2k rps during the peak of the load test


a year ago

There's no set RPS limit via the internal network fwiw.


brunohiis
PROOP

a year ago

Hmm okay, we'll run some additional tests to gather more info


a year ago

The endpoints that are being hit (not by curl), are they doing db operations?

And admittedly, "could not connect to server" isn't too informative, could you see about getting a far more verbose error?


rohukas
PRO

a year ago

Yeah, gonna run tons of tests. I think maybe the backend VPS has some bad network configs. Stuff like available sockets being exhausted etc


brunohiis
PROOP

a year ago

since the frontend service is working fine under the load it's 99% some shit on our end, we'll play around and provide an update soon


a year ago

btw, any reason to not use Railway's replica system over pm2?


a year ago

There is always the possibility it's on our end, but in all my years of doing support for Railway, I've never encountered someone running into any sort of RPS limits on the private network, so please let me know what you find.


Welcome!

Sign in to your Railway account to join the conversation.

Loading...