7 months ago
Hey!
We are doing some load testing to make sure that our setup is using all the resources it has.
However, we have found an interesting case where the internal networking seems to die in certain cases.
E.g. when we have our backend running in machine A and try to query it from machine B via the internal network (e.g. curl backend:5000) we get "Failed to connect to backend port 5000 after 142 ms: Could not connect to server". (This is under a load test, works fine normally).
However, in machine A (where the backend is running) we are successfully able to fetch response from localhost during that time.
Are there any limitations for the internal network that we should be aware of?
34 Replies
7 months ago
By "machine" do you mean "service"?
7 months ago
That's for public connections.
7 months ago
What kind of service is service A?
7 months ago
Aka 30 instances of the application?
7 months ago
What was your vCPU during load testing?
7 months ago
Then it seems like you are CPU starved.
frontend is running a basic nextjs app in a similar cluster mode as backend
7 months ago
Are you sure pm2 is distributing the incoming requests to all 30 instances?
yes, we have the same setup in our current production server and it's working perfectly
7 months ago
Then your backend should be able to vertically scale to about 30 vCPU, so something in code is a bottleneck.
Hopping in as well.
Our main problem is networking between the two services: ingress-proxy <-> backend.
During load testing ingress-proxy loses connection to backend.
Example:
/ (ingress-proxy) # curl backend.railway.internal:5000
curl: (7) Failed to connect to backend.railway.internal port 5000 after 280 ms: Could not connect to server
The backend instance is perfectly fine and running at that point when running commands in it:
/ (backend) # curl localhost:5000
{"message":"Organization not found","success":false}
Its as if the connection between the two boxes is cut off after a certain limit.
7 months ago
Could you give me a sense of what kind of load you are putting on the backend during this? RPS?
This is sent to ingress-proxy service, which forwards most of the stuff to the backend service
7 months ago
There's no set RPS limit via the internal network fwiw.
7 months ago
The endpoints that are being hit (not by curl), are they doing db operations?
And admittedly, "could not connect to server" isn't too informative, could you see about getting a far more verbose error?
Yeah, gonna run tons of tests. I think maybe the backend VPS has some bad network configs. Stuff like available sockets being exhausted etc
since the frontend service is working fine under the load it's 99% some shit on our end, we'll play around and provide an update soon
7 months ago
btw, any reason to not use Railway's replica system over pm2?
7 months ago
There is always the possibility it's on our end, but in all my years of doing support for Railway, I've never encountered someone running into any sort of RPS limits on the private network, so please let me know what you find.


