Sudden high response times
alex-bytes
PROOP

2 months ago

Our laravel backend in both staging and production environment is having up to 30s response times today whereas this would rarely ever be the case previously.

Yes, all services are in the same region and yes, we are using an internal db url. Request times were fine for the very same codebase yesterday.

Is there anything wrong with the servers today?

Solved

37 Replies

2 months ago

Hey, could I get a link to your project?


alex-bytes
PROOP

2 months ago

cdf1815d-d3de-4820-866b-7fea03248e92


alex-bytes
PROOP

2 months ago

What I think might be happening is that one specific, unoptimized query is being amplified due to whatever issues are occurring right now.


alex-bytes
PROOP

2 months ago

I'll investigate and see if it's the case by optimizing the query


alex-bytes
PROOP

2 months ago

Optimized the query and brought it down from ~25-30s to 15s for the production environment. Meanwhile the exact same query on the exact same dataset is taking 430ms to load in staging.


alex-bytes
PROOP

2 months ago

Yeah staging is fine again.. production is taking ages.

We've had this happen back in June 2025 and it lasted for 1-2 days, then stopped.


2 months ago

Are the datasets similar? If it's a larger table(s) it could be a missing index or the likes.


2 months ago

Probably not but just the first thing that came to mind


alex-bytes
PROOP

2 months ago

The data is identical in both environments


alex-bytes
PROOP

2 months ago

It might not even be the laravel backend itself, but redis cache instead. Most queries are running a bit slower, but fine. This one query does many redis lookups. I'll see if I can batch it and fix it this way. Then we know it's actually redis being slow.


alex-bytes
PROOP

2 months ago

But again, this is a SaaS with plenty of users and it's been running without issues for months.


2 months ago

What are your redis metrics?


alex-bytes
PROOP

2 months ago

Not much to see there. Barely any load.. Looks clean. No differences between staging / prod


alex-bytes
PROOP

2 months ago

Huh, this was it.. This one query grew with the database and more entities being added. We'd query redis once per entity instead of batching it (we grew more than anticipated) and now batching in production seems to have reduced the response time to what it was in staging.

Again, staging had the same amount of entities as the data is identical, but apparently in production (and for some periods of time in staging) there was an issue in querying redis this often. It'd be 90 calls for one response to be exact.


2 months ago

Yeah that'll do it. Glad you found the issue. 🙂


alex-bytes
PROOP

2 months ago

It shouldn't take 30s to run this query and it usually never did. So while this was not optimized, clearly there's slow infra which exacerbated this to the point of being noticeable, when it usually wasn't?


alex-bytes
PROOP

2 months ago

Feels like I found a fix to a slow query and I'm glad I did. But that slow query was usually still totally fine.


2 months ago

Do you have an exact time when the slow queries start, more exact than what the Response Time graph shows?


2 months ago

If you've not seen an increase on response times for other types of requests, I'd say it's unlikely to be slow infra.


alex-bytes
PROOP

2 months ago

Sadly not @Brody , sorry. Around Jan 4th apparently. But that's way too vague. I don't log the response times for graphql queries in production so I can't look it up in the logs.


2 months ago

It's not slow infrastructure; from what I see, it's the physical distance between the backend and Redis/Postgres.

The backend is on GCP and the databases are on Metal.


2 months ago

GraphQL <:PepeHands:611472658781175828>


alex-bytes
PROOP

2 months ago

Other, super light queries would take roughly 2-3x the amount of time between staging/prod. So I did see an increase in overall latency when I was testing today between environments. That one query was simply affected exponentially worse


alex-bytes
PROOP

2 months ago

What's wrong with GraphQL? 😄


alex-bytes
PROOP

2 months ago

Nevermind, let's no go there haha


alex-bytes
PROOP

2 months ago

Well, that's not ideal but I don't see any option past the regions to choose. I remember we were able to choose between metal/non-metal but that's no longer visible to me


2 months ago

It's automatic now, based on availability.


alex-bytes
PROOP

2 months ago

I see..


alex-bytes
PROOP

2 months ago

Well, how much latency are we talking per-request for GCP/Metal setups?


alex-bytes
PROOP

2 months ago

There's no way for us to see which service is located where, right?


2 months ago

You can see the region; we just don't provide visibility into which data center.

I don't have data on GCP <--> Metal latency.


alex-bytes
PROOP

2 months ago

Well, it should be a non-issue even if it's not perfectly in the same place I assume. So what's happening exactly? Hard to pinpoint I assume? It's just that with no change in payloads and code, querying redis shouldn't have such a strong variance.

I'll chalk this one up to my unoptimized query and stop taking up more of your time. Will keep monitoring this and come back if anything changes. It looks good for now


2 months ago

When you weren't seeing issues, your backend was on metal. When you started seeing issues, your backend was on GCP.

So this was just an unoptimized query (your words, not mine) that got amplified by the extra latency from the backend being in GCP and your databases being on metal.


alex-bytes
PROOP

2 months ago

So our deployments are basically "load-balanced" between metal/GCP?


2 months ago

In a way, yes.


alex-bytes
PROOP

2 months ago

@Brody Checks out. We just got an email from one of our Services that our API Keys were used from a new IP Address: Google Cloud (AS396982 Google LLC) on the 6th of January, 2PM in The Dalles, Oregon, US. That's one of the Google Data Centers.


2 months ago

Yep, that checks out.


Status changed to Solved brody 2 months ago


Loading...