a year ago
I have a service with 40+ replicas, and another service (from within the private network) fires request to to that service (at around 30-80rps), and sometimes all requests get handled by a single replica, sometimes just a few (2-3 replicas), this is very bad for us, is there any thing we can do at the app level to "force"/improve the load balancing to spread across more replicas?
Both source and dest services are using v2 runtime if it changes anything.
25 Replies
a year ago
Can you try disabling "Enable new proxy" inside your service and see if that resolves it?
a year ago
(I had already typed a lot when Jake's message came through but I still think the following message is helpful)
How replicas are handled over the private network is very different from how they are handled over the public network, as in you have to handle them.
When you access a service with replicas over the public network Railway's proxy does round-robin load balancing on the incoming requests.
But there is no proxy or anything like that in the private network so replicas are conveyed as multiple DNS results as shown here, - the hello world service has 5 replicas.
So your app that calls the service with the replicas has to keep a record of the DNS results dynamically and round robin between the multiple returned DNS results so spread the load between all the replicas, I show that being done with Caddy here, but it can be done in code as well.
a year ago
I would also firmly recommend you cut down the number of replicas you have. What you're currently running can run on 2-3 replicas.
a year ago
I would also firmly recommend you cut down the number of replicas you have. What you're currently running can run on 2-3 replicas.
That’s no good for us, we use those service as “workers” and many of the jobs end up waiting in the queue due to the “lack” of replicas.
I don’t think I understand your DNS answer.
When I set replicas=50 I, as a user, expect the requests to be spread across all replicas.
Btw can we contact sales and request a higher limit of replicas? (Assuming we solve those “sticky” replicas issue discussed here)
a year ago
I thought I did a very good job at clearly explaining the caveat with replicas over the private network, and the solution.
Railway no longer has control over how the requests are sent out to the service with replicas over the private network, that's something your app needs to handle by load balancing dynamically with the multiple IPs returned by the DNS query.
a year ago
That’s no good for us, we use those service as “workers” and many of the jobs end up waiting in the queue due to the “lack” of replicas.
Could you not bump the number of jobs that can be sent to those instances? Are the only capable of doing one job per?
Btw can we contact sales and request a higher limit of replicas? (Assuming we solve those “sticky” replicas issue discussed here)
Absolutely. Wanna grab some time at https://cal.com/team/railway/demonew and we can chat?
a year ago
I thought I did a very good job at clearly explaining the caveat with replicas over the private network, and the solution.
Railway no longer has control over how the requests are sent out to the service with replicas over the private network, that's something your app needs to handle by load balancing dynamically with the multiple IPs returned by the DNS query.
After reading your message a few times and seeing the example I think I got it - so basically a single DNS zone might point to multiple IPs.
That’s a long shot question this you’re not responsible for that - do you happen to know if axios has any built in mechanism to handle that?
a year ago
That’s no good for us, we use those service as “workers” and many of the jobs end up waiting in the queue due to the “lack” of replicas.
Could you not bump the number of jobs that can be sent to those instances? Are the only capable of doing one job per?
Btw can we contact sales and request a higher limit of replicas? (Assuming we solve those “sticky” replicas issue discussed here)
Absolutely. Wanna grab some time at https://cal.com/team/railway/demonew and we can chat?
We need as much parallelism as possible.
And thank you for the calendar link, we’ll definitely use that once we’re ready to bump the amount of replicas.
Side note: I’m shocked from the super impressive support you guys provide, thank you so much, this service is worth every penny.
a year ago
so basically a single DNS zone might point to multiple IPs
A single DNS query will contain multiple IPs according to how many replicas you have.
do you happen to know if axios has any built in mechanism to handle that?
I've only done this with Caddy as I showed in my example, but I'm sure you can do something in node with either axiom or something that sits before axiom, surely someone has written a module to do what you want, or you stick Caddy in front of the service with replicas and make your requests through Caddy.
a year ago
We need as much parallelism as possible.
And thank you for the calendar link, we’ll definitely use that once we’re ready to bump the amount of replicas.
For sure but, you have 32 vCPU per replica + memory. There's definitely parrellelism that's possible
Regardless, grab some time, lets chat about it live maybe tomorrow or Friday
Side note: I’m shocked from the super impressive support you guys provide, thank you so much, this service is worth every penny.
We try! It's hard but we try. Glad to hear you think we're doing a good job and we will keep it up!
a year ago
We need as much parallelism as possible.
And thank you for the calendar link, we’ll definitely use that once we’re ready to bump the amount of replicas.
For sure but, you have 32 vCPU per replica + memory. There's definitely parrellelism that's possible
Yep, We are using every vCPU and every byte of available RAM 
We are running a very resource intensive OCR jobs
a year ago
This job right? https://railway.app/project/f4832e31-a99b-4094-b4d4-35af53cfbebe/service/6dd5a5e0-359e-46f5-a725-4c919d31bcd0/metrics
I see it's using like 2GB and hard max 8 vCPU? Like, mostly flat here
With 50 replicas at 32vCPU, you have like, 1600 vCPU and 1.6TB of RAM headroom. Arguably we shouldn't even allow you to allocate that, but hey, we try to be super amenable :laugh:
Let me know if I'm missing something.
a year ago
This job right? https://railway.app/project/f4832e31-a99b-4094-b4d4-35af53cfbebe/service/6dd5a5e0-359e-46f5-a725-4c919d31bcd0/metrics
I see it's using like 2GB and hard max 8 vCPU? Like, mostly flat here
With 50 replicas at 32vCPU, you have like, 1600 vCPU and 1.6TB of RAM headroom. Arguably we shouldn't even allow you to allocate that, but hey, we try to be super amenable :laugh:
Let me know if I'm missing something.
You’re spot on, but we haven’t ran intensive jobs recently, try filtering by last 7 days (or just take a look at the attached screenshot)
Attachments
a year ago
Gotchya. That's still about 3 replicas worth.
I think Brody kinda mentioned it before, but if you want to be able to do intelligent routing, you can just query DNS, and then send it to, at random, one of the IPs returned
We're talking internally about doing this automatically for people, but no timeline on that. Also see your booking for tomorrow. We will confirm that on our side shortly; see ya then!
Cheers,
Jake from Railway
a year ago
For some reason the DNS query is failing:
DNS [FAIL]: process-batch -> Error: queryAny ENOTFOUND process-batch
a year ago
Are you sure you are doing a lookup for AAAA? since the IPs for services over the private network are IPv6 only.
a year ago
Are you sure you are doing a lookup for AAAA? since the IPs for services over the private network are IPv6 only.
Yep, that was it, thank you!
For future searches - I created an NPM package that modifies an axios instance to support this IP rotation:
https://www.npmjs.com/package/axios-dns-ip-rotation
I'll add docs later, but basically all you need to do is:
import AxiosRotateIP from 'axios-dns-ip-rotation';
AxiosRotateIP(axios);
BTW, do you happen to know why my services are limited to 8 vCPUs?
Attachments
a year ago
Wow that's definitely going above and beyond! I'll check out that module later!
BTW, do you happen to know why my services are limited to 8 vCPUs?
Are you sure your project containing those services is located within the Pro workspace?
a year ago
Wow that's definitely going above and beyond! I'll check out that module later!
BTW, do you happen to know why my services are limited to 8 vCPUs?
Are you sure your project containing those services is located within the Pro workspace?
It's in the Pro workspace. Other services in the same workspace have the right (32) limit.
a year ago
Where are you seeing this limit?
In the metrics of my service (as can be seen in the previous screenshot)
https://railway.app/project/f4832e31-a99b-4094-b4d4-35af53cfbebe/service/6dd5a5e0-359e-46f5-a725-4c919d31bcd0/metrics
a year ago
Could be a visual bug, but have you re-deployed the service since moving the project to the Pro workspace?
a year ago
Could be a visual bug, but have you re-deployed the service since moving the project to the Pro workspace?
The service+project were always Pro, I haven't touched anything regarding that since I deployed the project a few months ago.
I'll try redeploying the service.
a year ago
Great chatting today and great to hear you got the LB stuff sorted. Will close this out for now as we have the Slack channel
Status changed to Solved Railway • over 1 year ago




