Nestjs backend doesn't scale

5 months ago

Heya,

Could you share an example query to run to check for latency issues ? I can hit https://staging-api-v2.fantasy.top/yep in 75~250ms from Belgium (over wifi).

If you're hosting the database outside of railway, some latency is to be expected, but obviously for a no-db query that wouldn't make sense.

If you're under heavy load you may also consider replicas - though for your scale I don't expect them to be needed.

Best,

Nico

Status changed to Awaiting User Response railway[bot] • 5 months ago

kipitup

PRO

5 months ago

Hello Nico, thanks for your rapid answer.

You can try the following endpoint:

curl "https://staging-api-v2.fantasy.top/common/maintenance-time"
No call to db. simple endpoint that return "YEP", see here: https://gist.github.com/Kipitup/7be6623ff8c3cc9828f44d778a716b97#file-common-controller-ts-L73

curl "https://staging-api-v2.fantasy.top/common/maintenance"
1 call to db. very simple. https://gist.github.com/Kipitup/7be6623ff8c3cc9828f44d778a716b97#file-common-controller-ts-L65

we have created and deployed a basic expressJS template on railway (https://github.com/railwayapp-templates/expressjs/tree/main), i tried doing the load testing with the following configuration via artillery.io, running this locally:

# 1) Warm up for 2 minutes: 10 RPS → ramp to 50 RPS
# 2) Continue ramp for 3 minutes: 50 RPS → ramp to 150 RPS
# 3) Peak load for 5 minutes: 150 RPS → ramp to 250 RPS

I completely failed during the second phase. See my screenshot for more details.

We did some load testing with the express JS runing on local and it didn't have any issue sustaining the 250 rps. This indicate that railway could be the very core of our issue. I understand the auto scaling is not trivial on your side but to the point where it completely crashes the server, i don't think that's to be expected.
I saw in the doc that railway has a limit of 3000 RPS so i don't see why we can't do 100 RPS on the most basic express server that simply return "Hello World" with no logic or db access.

We deployed the simple version of our api (see the gist for details) on a ECS Fargate on AWS with 2vCPU and 4 GiB of RAM and we were able to sustain up to 500 RPS with a response time below 60ms and a high rate of success. Indicating once again that our issue is coming from railway somehow.

Attachments

Screenshot%...

Status changed to Awaiting Railway Response railway[bot] • 5 months ago

itsrems

5 months ago

Heya,

I'm not seeing any issues to hit either endpoints in ~75ms from Belgium, peaking at ~450ms for a small load test (100RPS).

As for real-world load, it looks like you're connecting over the public network to both redis and postgres. Swapping those to the private network would greatly improve performance

Best,

Nico

Status changed to Awaiting User Response railway[bot] • 5 months ago

0xmikado

PRO

5 months ago

Hey Nico,

Thanks for testing this out. Under 100 RPS, everything is fine indeed, but issues arise soon after and quickly shutdown the api. For example, I'm running the same setup as what kipit sent in his gist: https://gist.github.com/Kipitup/7be6623ff8c3cc9828f44d778a716b97#file-common-controller-ts-L73

The first screenshot shows the result of the artillery test on the railway instance, calling the maintenance-time endpoint (no db call, no redis):

A first warm-up load at 50 rps, with a constant 120ms response time.
Then we go from 50 to 450 rps, and as soon as we cross the 150 rps, we have a complete shutdown of the API, where the response time ends up at 0 because we can't reach the instance anymore. Even after the test, we can't reach it and we need to redeploy to get access to it again.

The second screenshot shows the same API code but running locally on the latest version of MacBook pro, 24GB of ram. We call the same endpoint, returning just a 'Yep' word, with no db or redis call. In this setup we can see the response time starting to increase at 500 rps, but it's only at 750 rps that we see errors arriving (bottom orange ish curve). Response time peaks at 450ms, then rps gradually comes down, as well as the response time.

So what we are seeing in terms of scale seems normal for a local setup with no load balancing, but the railway behavior is very odd.
We're trying to run the exact same API instance but running on an AWS EC2 instance, and there, we start to see issues at around 400-500rps, with no load balancing and on a much smaller instance than what we have setup on railway.

Also, for information, we aim to be able to sustain minimum 1k rps, and ideally 2-3k rps for the product release that we have in mind. Using railway, we have trouble seeing how we could get there. We can add more instances, we also have cloudflare caching and internal redis caching, we can switch our endpoint to internal network instead of public network, this will help, but not being able to reach 1k rps on a basic API endpoint, with no network call, with an instance scaled to its max, seems troubling.

Maybe we're doing something wrong?

Best,

Mikado

Attachments

Status changed to Awaiting Railway Response railway[bot] • 5 months ago

0xmikado

PRO

5 months ago

Also, I've tried candidly adding more instances of the current simple nestjs backend deployed on railway. I've added 3 more instances in the same region (4 total) but it seems like it doesn't change anything, and it freezes at 160 rps. In the doc it says, the load balacing is handled automatically, should I change something in my setup?

Attachments

5 months ago

Can you share the instance type that you are using on AWS? Also what version of Node that you are using?

Status changed to Awaiting User Response railway[bot] • 5 months ago

5 months ago

There are a few things that I can think of that would directly impact perf- one of the issues just being... well, the machine type we have on GCP. I am curious if you were to try a metal location to see if you are also able to get the thoughput you want. The only thing here that is a bit suspicious is that no matter how many replicas you have, it's hanging, which would imply another issue.

But - if you can give me that data, this will help me go from there to diagnose your performance issues.

kipitup

PRO

5 months ago

To deploy on railway we use Node.js version 20 specifically running on Alpine Linux 3.19 as the base image.

We are now using Elastic container service with AWS fargate. We've not tried a metal location but we did some test locally on our mac, and it was handling the load much better (see second screenshot of 0xmikado above).

Status changed to Awaiting Railway Response railway[bot] • 5 months ago

0xmikado

PRO

5 months ago

Hey Angelo, thanks for answering, I wanted to share some more test we've done. When I talk about nestjs setup, I'm talking about a simple nestjs api, with an 'Hello world' endpoint, so no netowrk/db/redis call. When I talk about express setup, I'm talking about a deployed instance of your template (railwayapp-templates/expressjs), with a simple hello world endpoint too.
Everything is deployed and tested in Fantasy Api - staging environment.

Nestjs, single instance in EU, no metal:

Nestjs, 4 instances in the same region (EU), no metal:

Nestjs, 4 instances in the same region (EU), no metal, cluster enabled in the API
Nestjs, 2 instances in EU, 2 instances metal in US

- Nestjs, 1 instance metal US

Express, 1 instance, EU, no metal

Express, 4 instancees in EU, no metal

Express, 2 instances in EU, 2 instances metal in US
Whichever test we are trying, all get the same result: A response time constant, that suddendly drops to 0 because the API can't be reach anymore, once we arrive to 100 ish RPS, sometimes 150. Seing the basic setup that we have and all the different test we've tried, the only logical solution I see right now is that there is an issue on our railway environnement, one that I'm not sure we're controlling.
The response time, under normal high load, should starts peaking, before we see errors arise. And once the load is gone, the endpoint should be back to normal. Here we have a complete shutdown of the API (at only 100 rps), and once that happens, we need to manually restart it for it to work again, otherwise it is permanently down/not reachable.
Adding instances, metal or not, doesn't solve anything, adding cluster in our API doesn't change a thing either.
As kipit said above, to deploy on railway we use Node.js version 20 specifically running on Alpine Linux 3.19 as the base image.
We are now using Elastic container service with AWS fargate. We was setup, we can reach 500 rps without a response time spike with machines at 4gb or ram and 4vcpu. Adding load balancing allowed to scale to 3k rps easily, with no spike in response time once stabilized.
Therefore, we are in the process of transitioning our infra from railway to AWS. If we were to find a solution for scaling on railway, we'd be ok to stay, but we are under time pressure and we don't have the luxury of waiting too long. We would need to figure out why Railway is shutting down at 100 rps, why instances doesn't scale, as well as clustering. It's fine if railway uses less powerful machines than what we get on aws, but here there is a clear scaling issue. Regardless of the type of services (nestjs or express) we use, we get the same limit.
I can stay available for a call or answering questions, but I would appreciate a deeper analysis on your end as to why this is happening.
Best,
Nicolas

5 months ago

Hello,

First off, thank you for all the test data and covering multiple scenarios and frameworks, this is very very strange, I have some ideas of what could be going on, but before jumping to any conclusions publicly, could you please provide your full configurations for artillery.io so that we can attempt to reproduce this on our side.

And fwiw we do have people sustaining much more than 100 rps, so something is definitely amiss here.

Best,

Brody

Status changed to Awaiting User Response railway[bot] • 5 months ago

https://gist.github.com/0xMikado/da8817a53110f85779123c1e6386fdf0

5 months ago

Had to jump, but here if you are willing to share your Artillery config.

Status changed to Awaiting User Response railway[bot] • 5 months ago

kipitup

PRO

5 months ago

[Reposting here for mikado has it seems he's not allowed to post anymore. he can see the convo but his reply are not going through.

Hey, sorry for missing your google meet link, I didn't catch the notification..

Attached are the 2 artillery setup we use, both works in local, but also in lamdba.

maintenance.yml calls the nestjs setup with a type of hello world endpoint (named maintenance-time), while simple-express also calls the basic hello world endpoints.

By the way, those artillery script are running fine on the same nestsj or express setup but run locally and not on railway. Same with the nestjs setup on AWS

I'm available for the next 3h, and you should be able to book a time here: https://calendly.com/0xmikado/30min otherwise we can schedule this for later.

Thanks for being responsive!

Status changed to Awaiting Railway Response railway[bot] • 5 months ago

5 months ago

I am free in 30- I don't see a slot on there :\

Status changed to Awaiting User Response railway[bot] • 5 months ago

Status changed to Awaiting Railway Response railway[bot] • 5 months ago

kipitup

PRO

5 months ago

could allow my team member to see and answer in this thread please?
thank you

5 months ago

Bad time to have lunch- requesting to join.

Status changed to Awaiting User Response railway[bot] • 5 months ago

5 months ago

Also- since you raised the thead- there isn't an easy to get shared viz. The best way we can do this is via Slack connect.

Attachments

5 months ago

This thread has been marked as public for community involvement, as it does not contain any sensitive or personal information. Any further activity in this thread will be visible to everyone.

Status changed to Open angelo • 5 months ago

5 months ago

Hey, quick update, I've run the artillery test from another container in the same project setup to call the internal domain for testing, and the entire test passed with flying colors!

So I am fully confident that with distributed organic traffic, a few replicas, and setting up the API to connect to Postgres and Redis via the private network, you will have no problem serving the traffic you expect to get.

0xmikado

PRO

5 months ago

Hey Angelo and Brody,

I confirm that load testing from multiple IPs with lambda avoids the WAF and enables us to fully test multiple instances, as well as clustering. We're reaching 2.5k succesful RPS with this setup

However, we're facing a new issue with the deployment. This morning I saw metal EU was available as a region, I enabled it and the deployment was successful. But now the option has been removed, but the EU metal instance is still selected as a region to deploy onto it seems. So right now, even if only 1 instance (eu no metal) is selected, it says 2 instances on the frontend (europe-west4, europe-west4-drams3a). So it feels like we're still trying to deploy on the EU metal instance, even though it's not available anymore? I tried redeploying or chaging instances number but it doesn't fix the issue. Any idea how to do it?

Attachments

5 months ago

Hello!

That's awesome news! very happy to hear that!

And yep the option for EU metal accidentally made its way onto production and we quickly removed it, I've just reverted your region change back to EU GCP.

Though this brings up the topic of having services on and off metal, since you have databases, and services with volumes cannot be on metal yet, I recommend keeping everything off metal until metal supports volumes and then you can have everything on metal.

0xmikado

PRO

5 months ago

Ok awesome, thanks!

Looking forward to have metal supports volume, if this gives even more scale, it sounds existing!

Thanks again for the support, very much appreciated.