2 months ago
Our laravel backend in both staging and production environment is having up to 30s response times today whereas this would rarely ever be the case previously.
Yes, all services are in the same region and yes, we are using an internal db url. Request times were fine for the very same codebase yesterday.
Is there anything wrong with the servers today?
37 Replies
2 months ago
Hey, could I get a link to your project?
What I think might be happening is that one specific, unoptimized query is being amplified due to whatever issues are occurring right now.
Optimized the query and brought it down from ~25-30s to 15s for the production environment. Meanwhile the exact same query on the exact same dataset is taking 430ms to load in staging.
Yeah staging is fine again.. production is taking ages.
We've had this happen back in June 2025 and it lasted for 1-2 days, then stopped.
2 months ago
Are the datasets similar? If it's a larger table(s) it could be a missing index or the likes.
2 months ago
Probably not but just the first thing that came to mind
It might not even be the laravel backend itself, but redis cache instead. Most queries are running a bit slower, but fine. This one query does many redis lookups. I'll see if I can batch it and fix it this way. Then we know it's actually redis being slow.
But again, this is a SaaS with plenty of users and it's been running without issues for months.
2 months ago
What are your redis metrics?
Not much to see there. Barely any load.. Looks clean. No differences between staging / prod
Huh, this was it.. This one query grew with the database and more entities being added. We'd query redis once per entity instead of batching it (we grew more than anticipated) and now batching in production seems to have reduced the response time to what it was in staging.
Again, staging had the same amount of entities as the data is identical, but apparently in production (and for some periods of time in staging) there was an issue in querying redis this often. It'd be 90 calls for one response to be exact.
2 months ago
Yeah that'll do it. Glad you found the issue. 🙂
It shouldn't take 30s to run this query and it usually never did. So while this was not optimized, clearly there's slow infra which exacerbated this to the point of being noticeable, when it usually wasn't?
Feels like I found a fix to a slow query and I'm glad I did. But that slow query was usually still totally fine.
2 months ago
Do you have an exact time when the slow queries start, more exact than what the Response Time graph shows?
2 months ago
If you've not seen an increase on response times for other types of requests, I'd say it's unlikely to be slow infra.
Sadly not @Brody , sorry. Around Jan 4th apparently. But that's way too vague. I don't log the response times for graphql queries in production so I can't look it up in the logs.
2 months ago
It's not slow infrastructure; from what I see, it's the physical distance between the backend and Redis/Postgres.
The backend is on GCP and the databases are on Metal.
2 months ago
GraphQL <:PepeHands:611472658781175828>
Other, super light queries would take roughly 2-3x the amount of time between staging/prod. So I did see an increase in overall latency when I was testing today between environments. That one query was simply affected exponentially worse
Well, that's not ideal but I don't see any option past the regions to choose. I remember we were able to choose between metal/non-metal but that's no longer visible to me
2 months ago
It's automatic now, based on availability.
2 months ago
You can see the region; we just don't provide visibility into which data center.
I don't have data on GCP <--> Metal latency.
Well, it should be a non-issue even if it's not perfectly in the same place I assume. So what's happening exactly? Hard to pinpoint I assume? It's just that with no change in payloads and code, querying redis shouldn't have such a strong variance.
I'll chalk this one up to my unoptimized query and stop taking up more of your time. Will keep monitoring this and come back if anything changes. It looks good for now
2 months ago
When you weren't seeing issues, your backend was on metal. When you started seeing issues, your backend was on GCP.
So this was just an unoptimized query (your words, not mine) that got amplified by the extra latency from the backend being in GCP and your databases being on metal.
2 months ago
In a way, yes.
@Brody Checks out. We just got an email from one of our Services that our API Keys were used from a new IP Address: Google Cloud (AS396982 Google LLC) on the 6th of January, 2PM in The Dalles, Oregon, US. That's one of the Google Data Centers.
2 months ago
Yep, that checks out.
Status changed to Solved brody • 2 months ago