Load balancing issue - Railway Help Station

Load balancing issue

jerryvdp

PROOP

2 years ago

I have a service with 40+ replicas, and another service (from within the private network) fires request to to that service (at around 30-80rps), and sometimes all requests get handled by a single replica, sometimes just a few (2-3 replicas), this is very bad for us, is there any thing we can do at the app level to "force"/improve the load balancing to spread across more replicas?

Both source and dest services are using v2 runtime if it changes anything.

Solved

25 Replies

2 years ago

Can you try disabling "Enable new proxy" inside your service and see if that resolves it?

2 years ago

(I had already typed a lot when Jake's message came through but I still think the following message is helpful)

How replicas are handled over the private network is very different from how they are handled over the public network, as in you have to handle them.

When you access a service with replicas over the public network Railway's proxy does round-robin load balancing on the incoming requests.

But there is no proxy or anything like that in the private network so replicas are conveyed as multiple DNS results as shown here, - the hello world service has 5 replicas.

So your app that calls the service with the replicas has to keep a record of the DNS results dynamically and round robin between the multiple returned DNS results so spread the load between all the replicas, I show that being done with Caddy here, but it can be done in code as well.

2 years ago

I would also firmly recommend you cut down the number of replicas you have. What you're currently running can run on 2-3 replicas.

jerryvdp

PROOP

2 years ago

I would also firmly recommend you cut down the number of replicas you have. What you're currently running can run on 2-3 replicas.

That’s no good for us, we use those service as “workers” and many of the jobs end up waiting in the queue due to the “lack” of replicas.

I don’t think I understand your DNS answer.

When I set replicas=50 I, as a user, expect the requests to be spread across all replicas.

Btw can we contact sales and request a higher limit of replicas? (Assuming we solve those “sticky” replicas issue discussed here)

2 years ago

I thought I did a very good job at clearly explaining the caveat with replicas over the private network, and the solution.

Railway no longer has control over how the requests are sent out to the service with replicas over the private network, that's something your app needs to handle by load balancing dynamically with the multiple IPs returned by the DNS query.

2 years ago

That’s no good for us, we use those service as “workers” and many of the jobs end up waiting in the queue due to the “lack” of replicas.

Could you not bump the number of jobs that can be sent to those instances? Are the only capable of doing one job per?

Btw can we contact sales and request a higher limit of replicas? (Assuming we solve those “sticky” replicas issue discussed here)

Absolutely. Wanna grab some time at https://cal.com/team/railway/demonew and we can chat?

jerryvdp

PROOP

2 years ago

I thought I did a very good job at clearly explaining the caveat with replicas over the private network, and the solution.
Railway no longer has control over how the requests are sent out to the service with replicas over the private network, that's something your app needs to handle by load balancing dynamically with the multiple IPs returned by the DNS query.

After reading your message a few times and seeing the example I think I got it - so basically a single DNS zone might point to multiple IPs.

That’s a long shot question this you’re not responsible for that - do you happen to know if axios has any built in mechanism to handle that?

jerryvdp

PROOP

2 years ago

That’s no good for us, we use those service as “workers” and many of the jobs end up waiting in the queue due to the “lack” of replicas.
Could you not bump the number of jobs that can be sent to those instances? Are the only capable of doing one job per?
Btw can we contact sales and request a higher limit of replicas? (Assuming we solve those “sticky” replicas issue discussed here)
Absolutely. Wanna grab some time at https://cal.com/team/railway/demonew and we can chat?

We need as much parallelism as possible.

And thank you for the calendar link, we’ll definitely use that once we’re ready to bump the amount of replicas.

Side note: I’m shocked from the super impressive support you guys provide, thank you so much, this service is worth every penny.

2 years ago

so basically a single DNS zone might point to multiple IPs

A single DNS query will contain multiple IPs according to how many replicas you have.

do you happen to know if axios has any built in mechanism to handle that?

I've only done this with Caddy as I showed in my example, but I'm sure you can do something in node with either axiom or something that sits before axiom, surely someone has written a module to do what you want, or you stick Caddy in front of the service with replicas and make your requests through Caddy.

2 years ago

We need as much parallelism as possible.
And thank you for the calendar link, we’ll definitely use that once we’re ready to bump the amount of replicas.

For sure but, you have 32 vCPU per replica + memory. There's definitely parrellelism that's possible

Regardless, grab some time, lets chat about it live maybe tomorrow or Friday

Side note: I’m shocked from the super impressive support you guys provide, thank you so much, this service is worth every penny.

We try! It's hard but we try. Glad to hear you think we're doing a good job and we will keep it up!

jerryvdp

PROOP

2 years ago

We need as much parallelism as possible.
And thank you for the calendar link, we’ll definitely use that once we’re ready to bump the amount of replicas.
For sure but, you have 32 vCPU per replica + memory. There's definitely parrellelism that's possible

Yep, We are using every vCPU and every byte of available RAM

We are running a very resource intensive OCR jobs

2 years ago

This job right? https://railway.app/project/f4832e31-a99b-4094-b4d4-35af53cfbebe/service/6dd5a5e0-359e-46f5-a725-4c919d31bcd0/metrics

I see it's using like 2GB and hard max 8 vCPU? Like, mostly flat here

With 50 replicas at 32vCPU, you have like, 1600 vCPU and 1.6TB of RAM headroom. Arguably we shouldn't even allow you to allocate that, but hey, we try to be super amenable :laugh:

Let me know if I'm missing something.

jerryvdp

PROOP

2 years ago

This job right? https://railway.app/project/f4832e31-a99b-4094-b4d4-35af53cfbebe/service/6dd5a5e0-359e-46f5-a725-4c919d31bcd0/metrics
I see it's using like 2GB and hard max 8 vCPU? Like, mostly flat here
With 50 replicas at 32vCPU, you have like, 1600 vCPU and 1.6TB of RAM headroom. Arguably we shouldn't even allow you to allocate that, but hey, we try to be super amenable :laugh:
Let me know if I'm missing something.

You’re spot on, but we haven’t ran intensive jobs recently, try filtering by last 7 days (or just take a look at the attached screenshot)

Attachments

2 years ago

Gotchya. That's still about 3 replicas worth.

I think Brody kinda mentioned it before, but if you want to be able to do intelligent routing, you can just query DNS, and then send it to, at random, one of the IPs returned

We're talking internally about doing this automatically for people, but no timeline on that. Also see your booking for tomorrow. We will confirm that on our side shortly; see ya then!

Cheers,

Jake from Railway

jerryvdp

PROOP

2 years ago

For some reason the DNS query is failing:

DNS [FAIL]: process-batch -> Error: queryAny ENOTFOUND process-batch

2 years ago

Are you sure you are doing a lookup for AAAA? since the IPs for services over the private network are IPv6 only.

jerryvdp

PROOP

2 years ago

Are you sure you are doing a lookup for AAAA? since the IPs for services over the private network are IPv6 only.

Yep, that was it, thank you!

For future searches - I created an NPM package that modifies an axios instance to support this IP rotation:

https://www.npmjs.com/package/axios-dns-ip-rotation

I'll add docs later, but basically all you need to do is:

import AxiosRotateIP from 'axios-dns-ip-rotation';

AxiosRotateIP(axios);

BTW, do you happen to know why my services are limited to 8 vCPUs?

Attachments

2 years ago

Wow that's definitely going above and beyond! I'll check out that module later!

BTW, do you happen to know why my services are limited to 8 vCPUs?

Are you sure your project containing those services is located within the Pro workspace?

jerryvdp

PROOP

2 years ago

Wow that's definitely going above and beyond! I'll check out that module later!
BTW, do you happen to know why my services are limited to 8 vCPUs?
Are you sure your project containing those services is located within the Pro workspace?

It's in the Pro workspace. Other services in the same workspace have the right (32) limit.

2 years ago

Where are you seeing this limit?

jerryvdp

PROOP

2 years ago

Where are you seeing this limit?

In the metrics of my service (as can be seen in the previous screenshot)

https://railway.app/project/f4832e31-a99b-4094-b4d4-35af53cfbebe/service/6dd5a5e0-359e-46f5-a725-4c919d31bcd0/metrics

2 years ago

Could be a visual bug, but have you re-deployed the service since moving the project to the Pro workspace?

jerryvdp

PROOP

2 years ago

Could be a visual bug, but have you re-deployed the service since moving the project to the Pro workspace?

The service+project were always Pro, I haven't touched anything regarding that since I deployed the project a few months ago.

I'll try redeploying the service.

2 years ago

Then I may have to defer to Jake here.

2 years ago

Great chatting today and great to hear you got the LB stuff sorted. Will close this out for now as we have the Slack channel

Status changed to Solved Railway • over 1 year ago