6 months ago
Project ID: 02edd57b-a5c8-43a8-9e48-93212c876e12
Region: US-West Metal (both database and application)
I'm experiencing random spikes in write performance degradation with a volume-backed Postgres instance.
In order to discard a database specific issue I ran some load testing benchmarks on a freshly backed-up database on my computer, but can't seem to replicate it.
32 Replies
I've noticed there's an incident possibly related to this that was resolved a couple days ago, however in my case it seems to only impact writes
I can also confirm it's not an issue with writes volume (even if I don't have any metrics to back this up), the API only exposes 2 write operations and both were at a pace of 1 per minute or so at that moment
some more info: extra traces with overall duration (rightmost column) to show variance, details of an overall well performing trace, and details of the worst performing trace of that time period



6 months ago
So this is still ongoing as of right now?
i'm standing up a staging environment to confirm, as writes in production entirely depend on some live streams being active, which won't until friday 😅
but the latest i can confirm it being ongoing is earlier today at 00:44 UTC-3
6 months ago
Gotcha, please keep me updated.
load testing results look fine on staging, still some high latency cases but nowhere nearly as bad as last night. i could attribute these to natural performance cases

i'll monitor again this weekend just in case, but for now i suppose this can be considered solved
6 months ago
Ill keep it open, so please report back on any issues!
gotcha, thanks! i'll also try to setup some postgres metrics ingest over the week, just in case the extra metrics help
6 months ago
Thanks!
Volume performance seems fine, but there might be an issue with networking now? It seems that on opening new connections it resolves slowly, but any subsequent database commands run fine. At the very least Redis stats don't seem to indicate a particular latency issue (p99 for XADD is in the microseconds order of magniture)

not sure if it's related but sometimes project logs take almost a minute to load as well on the dashboard
5 months ago
resolves slowly, as in dns lookup?
5 months ago
also, what timezone is that timestamp?
UTC-3, sorry! and i'm unsure about where exactly it could be failing. let me see if i can get some more detailed metrics
5 months ago
It could be DNS, we saw a massive spike in lookup time from 2025-06-27T23:07:35.863Z to 2025-06-28T00:02:43.649Z (UTC)
5 months ago
is there any senario in your code where an XADD would open a new connection (and thus resolve the private domain again)?
5 months ago
because when looking at data for RTT minus dns lookup times, the actual transport times are fine
that's what i'm trying to figure out. going just by the code i'm using redis connection pooling, and the prometheus metrics don't show a fluctuation in the amount of open connections over time
5 months ago
i think its a safe bet to say it was dns lookup time
5 months ago
and in that regard, ive spoken to one of our infra engineers and they identified the issue
makes sense. i think i identified my part of the issue as the redis python client having multiple ways of handling connection pools, one of them auto-closing connections after usage 😅
so a different handling would've lessened the user impact of this, since my postgres config already uses long-lived connections
5 months ago
yeah a pool min greater than zero would have helped, even if there isn't increased latency, dns lookup times arent the best they could be
i'll keep it in mind for future improvements. turns out the python redis lib doesn't expose an evident way to initialize a pool with a minimum amount of connections. i have a vague idea of how to do it manually, but i'd rather sleep through the headache beforehand lol
5 months ago
sounds good!

