Internal network limitations

brunohiis

PROOP

a year ago

Hey!

We are doing some load testing to make sure that our setup is using all the resources it has.

However, we have found an interesting case where the internal networking seems to die in certain cases.

E.g. when we have our backend running in machine A and try to query it from machine B via the internal network (e.g. curl backend:5000) we get "Failed to connect to backend port 5000 after 142 ms: Could not connect to server". (This is under a load test, works fine normally).

However, in machine A (where the backend is running) we are successfully able to fetch response from localhost during that time.

Are there any limitations for the internal network that we should be aware of?

34 Replies

brunohiis

PROOP

a year ago

fd1a1c68-5bf0-4d48-afaa-fa810942c096

brody

EMPLOYEE

a year ago

By "machine" do you mean "service"?

brunohiis

PROOP

a year ago

yes, sorry

brunohiis

PROOP

a year ago

1388612452689907902

brody

EMPLOYEE

a year ago

That's for public connections.

brody

EMPLOYEE

a year ago

What kind of service is service A?

brunohiis

PROOP

a year ago

It's running a NodeJS backend managed by PM2 in cluster mode

brunohiis

PROOP

a year ago

I think it's using like 30 clusters

brody

EMPLOYEE

a year ago

Aka 30 instances of the application?

brunohiis

PROOP

a year ago

Yup

brody

EMPLOYEE

a year ago

What was your vCPU during load testing?

brunohiis

PROOP

a year ago

very low, it hit like 5 cpu cores max

brunohiis

PROOP

a year ago

it should use 1 cpu per cluster as the max

brody

EMPLOYEE

a year ago

Then it seems like you are CPU starved.

brunohiis

PROOP

a year ago

1388613103108751380

brunohiis

PROOP

a year ago

we have set this as the resource limit tho

brunohiis

PROOP

a year ago

frontend seems to scale fine up to 25 cores with basically the same setup

brunohiis

PROOP

a year ago

frontend is running a basic nextjs app in a similar cluster mode as backend

brody

EMPLOYEE

a year ago

Are you sure pm2 is distributing the incoming requests to all 30 instances?

brunohiis

PROOP

a year ago

yes, we have the same setup in our current production server and it's working perfectly

brody

EMPLOYEE

a year ago

Then your backend should be able to vertically scale to about 30 vCPU, so something in code is a bottleneck.

rohukas

PRO

a year ago

Hopping in as well.

Our main problem is networking between the two services: ingress-proxy <-> backend.

During load testing ingress-proxy loses connection to backend.

Example:

/ (ingress-proxy) # curl backend.railway.internal:5000

curl: (7) Failed to connect to backend.railway.internal port 5000 after 280 ms: Could not connect to server

The backend instance is perfectly fine and running at that point when running commands in it:

/ (backend) # curl localhost:5000

{"message":"Organization not found","success":false}

rohukas

PRO

a year ago

Its as if the connection between the two boxes is cut off after a certain limit.

brody

EMPLOYEE

a year ago

Could you give me a sense of what kind of load you are putting on the backend during this? RPS?

brunohiis

PROOP

a year ago

1388614014371893422

brunohiis

PROOP

a year ago

This is sent to ingress-proxy service, which forwards most of the stuff to the backend service

brunohiis

PROOP

a year ago

Should be around 2k rps during the peak of the load test

brody

EMPLOYEE

a year ago

There's no set RPS limit via the internal network fwiw.

brunohiis

PROOP

a year ago

Hmm okay, we'll run some additional tests to gather more info

brody

EMPLOYEE

a year ago

The endpoints that are being hit (not by curl), are they doing db operations?

And admittedly, "could not connect to server" isn't too informative, could you see about getting a far more verbose error?

rohukas

PRO

a year ago

Yeah, gonna run tons of tests. I think maybe the backend VPS has some bad network configs. Stuff like available sockets being exhausted etc

brunohiis

PROOP

a year ago

since the frontend service is working fine under the load it's 99% some shit on our end, we'll play around and provide an update soon

passos

MODERATOR

a year ago

btw, any reason to not use Railway's replica system over pm2?

brody

EMPLOYEE

a year ago

There is always the possibility it's on our end, but in all my years of doing support for Railway, I've never encountered someone running into any sort of RPS limits on the private network, so please let me know what you find.

Welcome!