Major slow down with all non-cached requests pending

jclaveau
PRO

a year ago

Since around 3 pm (Paris) our users encounter major slow downs on our app. All requests to our api service remain in "pending" status for a while and the page needs several minutes to load fully.

  • This occurs the day we have a lot of new users / users beginning to use our app intensively (as they are back to school)

  • My api service serves a Directus app with PRESSURELIMITERENABLED but in our case we do not get any 503 error like the pressure limiter throws in case of overloading https://docs.directus.io/self-hosted/config-options.html#pressure-based-rate-limiter

  • The metrics doesn't seem to be saturated (in attachment)

  • I have no unusual errors in my logs

  • My other service Grafana works like a charm querying the same Postgres DB

  • I use the v2 runtime since last night on all my services

  • Our project is hosted in Amsterdam mostly for french users

  • We subscribed to a pro plan

It looks like there is some bottleneck / throttle somewhere on the network that I can not access to. So, after checking evrything I could, I need your help.

Thanks in advance!!

Video of the user experience https://youtu.be/e8wVv_bSaXM

Project Id: 65aff0db-6586-4be0-8420-b2e67ae4378d

0 Replies

a year ago

hello, do you have any idea on how many RPS you may be seeing?


jclaveau
PRO

a year ago

Presently not but I will dig


jclaveau
PRO

a year ago

I may have it soon


a year ago

perfect!


a year ago

backend is the directus service right?


jclaveau
PRO

a year ago

Yes


jclaveau
PRO

a year ago

I will have the request number soon


a year ago

perfect


jclaveau
PRO

a year ago

You can get them here https://api.hiphiphip.app/status


jclaveau
PRO

a year ago

As the service just restarted some users may have been logged out


jclaveau
PRO

a year ago

Btw, presently the experience is smooth


jclaveau
PRO

a year ago

(First time for 5 hours)


a year ago

the current RPS that is being reported would be lower than our RPS limit, so you aren't running into any kind of platform limitations at the moment.

keep an eye on this RPS number when / if you see issues again and feel free to ping me with that info


jclaveau
PRO

a year ago

(some pendings occur but not as long as before the redeploys)


a year ago

at this time, id have to say this is an application level issue


a year ago

maybe you could try something like increase the postgres pool count?


jclaveau
PRO

a year ago

But Directus has no query limit by default, they shouldn't be in pending mode, right?


a year ago

im sure there are more factors at play here, can you help me to understand your infra more?


jclaveau
PRO

a year ago

Can you give me the limit rps rate so I know if it at an app level or noit plz?


jclaveau
PRO

a year ago

Sure, you don't have access to my project?


a year ago

i dont know if i can give out the current values for that, sorry, but you are currently well under the limit


a year ago

i do, but id like to understand how it all works together


a year ago

for example, im now seeing the rps for the api, but you said requests to directus are pending, not the api ?


jclaveau
PRO

a year ago

Yes I guess, I'm very surprised by this issue 🙂


jclaveau
PRO

a year ago

Request to api.hiphiphip.app are pending


jclaveau
PRO

a year ago

the backend service is private


a year ago

the api calls directus via the private network?


jclaveau
PRO

a year ago

  • bo.hiphiphip.app is a Directus instance with the admin enabled.

  • api.hiphiphip.app is the same Directus (cloned) without the admin ^panel enable

  • www.hiphiphip.app calls api.hiphiphip.app (never the bo directly)

  • the 2 directus services, bo and api access Postgres and Redis through the private network only

  • there is also a grafana service using Postgres and Redis (via the private network) and a last service backuping Postgres at 5am to AWS


jclaveau
PRO

a year ago

1280587289373970400


a year ago

are you absolutely positive you are doing all the communicate that you can over the private network?


jclaveau
PRO

a year ago

yes


jclaveau
PRO

a year ago

it costed us too much 🙂


jclaveau
PRO

a year ago

I changed all this the pas week


a year ago

haha yeah that can happen


jclaveau
PRO

a year ago

1280589179683864600


jclaveau
PRO

a year ago

As you can see, the last 3 remaining services having egress are Frontend (green), api (yellow) and grafana (red which had metrics published on our blog until today)


jclaveau
PRO

a year ago

Postgres is violet (with the backups every day) and redis is red


jclaveau
PRO

a year ago

If you look close you can see the backup sending to aws every night in blue


a year ago

gotcha, thank you for the rundown


a year ago

is there anywhere i could go to see these pending requests?


jclaveau
PRO

a year ago

I'll create you a demo account


a year ago

thanks!


jclaveau
PRO

a year ago

The account is being created but the app is quite slow again… :/


a year ago

not seing anyting that would indicate an issue on our side of things, perhaps you could give the api more replicas?


a year ago

start with 3


jclaveau
PRO

a year ago


a year ago

we do not lol, but kudo's for trying 😆


jclaveau
PRO

a year ago

Thank you, my brain is totally out of use presently <:oop:1231933790671208499>


a year ago

can you go ahead and add 3 replicas to the api?


a year ago

if one of your api services in not able to handle your volume of traffic, 3 might be able to


jclaveau
PRO

a year ago

also can i disable the legacy proxy?


a year ago

you would want that off, yes


a year ago

off on everything, the new proxy is far superior


a year ago

at the same time go ahead and add those 3 replicas


jclaveau
PRO

a year ago

it's deploying


a year ago

2 is close enough to 3 haha


jclaveau
PRO

a year ago

^^


a year ago

can you disable the legacy proxy on your other services too please


jclaveau
PRO

a year ago

hum, railway seems to be buggy: it doesn't propose to deploy when i disable the legacy proxy


a year ago

thats normal, that change is not a part of the staged changes


jclaveau
PRO

a year ago

ok so we're good


jclaveau
PRO

a year ago

is there a way to display the equivalent of https://api.hiphiphip.app/status but for each replica?


a year ago

nope, you'd only ever see the page for one replica since incoming requests are round robin


jclaveau
PRO

a year ago

Yes! The service is perfect now!


jclaveau
PRO

a year ago

understood


a year ago

okay cool so it seems directus was just a little stressed out is all


jclaveau
PRO

a year ago

Yeah, thank you very much!


jclaveau
PRO

a year ago

The odd point is that it doesn't return the expected 503


a year ago

if you gain more userbase, you can always add another replica!


jclaveau
PRO

a year ago

We'll probably need to


jclaveau
PRO

a year ago

I didn,'t expect that to come so early 🙂


a year ago

happy it was an easy fix!


jclaveau
PRO

a year ago

yes!


jclaveau
PRO

a year ago

I full a little dumb actually <:mildpanic:804271964587819059>


jclaveau
PRO

a year ago

*feel


a year ago

nah dont worry about it, it took me until now to suggest it too lol


jclaveau
PRO

a year ago

hahaha


jclaveau
PRO

a year ago

Let's stop to consume your time, we risk to have to pay some other fees :p


jclaveau
PRO

a year ago

Thank you very much for your fast and clever support


a year ago

happy to help! i wish you all the best with your service and its growth!


jclaveau
PRO

a year ago

<:salute:1137099685417451530>


jclaveau
PRO

a year ago

Just a question: Is there some autoscaling feature in the pipe, depending on the moment of the day this could save resources and money?


a year ago

we do not have any immediate plans for auto h-scaling


jclaveau
PRO

a year ago

Ok, thank you!