p95 at 30s
angus-lau
PROOP

22 days ago

We are randomly getting spikes about our P95 being at around 30 seconds. But there is no information on why that is happening. Also our CPU usage drops to 0 periodically, but I'm assuming it's because it's serverless and is idling. We've tried optimizing all of our endpoints but we're still getting random P95 spikes.

$30 Bounty

8 Replies

Are the P95 spikes happening randomly during active traffic, or only after periods of low/no traffic? 
if Serverless is enabled.. that's your P95 spike. The CPU dropping to 0 is the service sleeping.and 30s spike maybe related to cold start waking it up.

https://docs.railway.com/deployments/serverless


angus-lau
PROOP

19 days ago

Serverless is not enabled. The spikes are happening randomly, we currently have low traffic because we haven't released the app yet. I've attached a picture of the metrics for some more context. Thanks.


angus-lau

Serverless is not enabled. The spikes are happening randomly, we currently have low traffic because we haven't released the app yet. I've attached a picture of the metrics for some more context. Thanks.

This helps a lot,
most requests are fast, but a small number are extremely slow.. also you said low traffic right....one slow request can dominate p95.. and 4xx bar is huge compared to others maybe client errors or maybe calling endpoints incorrectly during testing .. here CPU dropping to 0 means no compute is happening ..can you check HTTP logs during this period for paths and errors


dharmateja

This helps a lot, most requests are fast, but a small number are extremely slow.. also you said low traffic right....one slow request can dominate p95.. and 4xx bar is huge compared to others maybe client errors or maybe calling endpoints incorrectly during testing .. here CPU dropping to 0 means no compute is happening ..can you check HTTP logs during this period for paths and errors

angus-lau
PROOP

19 days ago

Yes, low traffic. This is our V2 production we're testing right now, we have 6 active users that being the developers. We do pull information on chain, or relay or helius, which can sometimes take a 5-8s. the issue with one slow request dominating p95 is that once p95 spikes, the entire app freezes and we can't even load the app. I'm thinking it has something to do with CORS middleware or something before all our endpoints.

If we were to call endpoints incorrectly during testing, I assume the FE team would tell me that they're not getting any information right? Is no compute happening a bad thing?

We had a spike from 630-642, and we have this endpoint getting spammed, but even on low latency times, this endpoint still gets spammed anyways. This is also why we have a high 4xx.

Attachments


Most of your “error rate” spike is coming from /api/events getting spammed and returning 429 (rate limited) and some 403..from Logs..but they can still make the app feel “frozen” ...if the client is retrying aggressively or if endpoint is consuming shared resources (DB/RPC connection pool, upstream calls, )...CPU dropping to 0 usually just means no CPU work and thats completely normal ...CORS is unlikely here cause if it were CORS you’d see consistent browser console errors and lots of failingn not some random 30s spikes..can you paste HTTP logs for during 30sec spikes?


angus-lau
PROOP

19 days ago

During the 30sec spikes, these are what are getting posted to the HTTP logs. However, there is also other examples such as relay, Solana, Helios that sometimes listed 1s as the latency.


19 days ago

Same issue, getting latency spike notifications.


haayhappen
PRO

19 days ago

same


Loading...