Potential volume write degradation
skyaviatour
HOBBYOP

6 months ago

Project ID: 02edd57b-a5c8-43a8-9e48-93212c876e12
Region: US-West Metal (both database and application)

I'm experiencing random spikes in write performance degradation with a volume-backed Postgres instance.

In order to discard a database specific issue I ran some load testing benchmarks on a freshly backed-up database on my computer, but can't seem to replicate it.

32 Replies

skyaviatour
HOBBYOP

6 months ago

I've noticed there's an incident possibly related to this that was resolved a couple days ago, however in my case it seems to only impact writes


skyaviatour
HOBBYOP

6 months ago

I can also confirm it's not an issue with writes volume (even if I don't have any metrics to back this up), the API only exposes 2 write operations and both were at a pace of 1 per minute or so at that moment


skyaviatour
HOBBYOP

6 months ago

some more info: extra traces with overall duration (rightmost column) to show variance, details of an overall well performing trace, and details of the worst performing trace of that time period

1386728419982836000
1386728420393619500
1386728420716839000


brody
EMPLOYEE

6 months ago

So this is still ongoing as of right now?


skyaviatour
HOBBYOP

6 months ago

i'm standing up a staging environment to confirm, as writes in production entirely depend on some live streams being active, which won't until friday 😅


skyaviatour
HOBBYOP

6 months ago

but the latest i can confirm it being ongoing is earlier today at 00:44 UTC-3


brody
EMPLOYEE

6 months ago

Gotcha, please keep me updated.


skyaviatour
HOBBYOP

6 months ago

load testing results look fine on staging, still some high latency cases but nowhere nearly as bad as last night. i could attribute these to natural performance cases

1386745668441477000


skyaviatour
HOBBYOP

6 months ago

i'll monitor again this weekend just in case, but for now i suppose this can be considered solved


brody
EMPLOYEE

6 months ago

Ill keep it open, so please report back on any issues!


skyaviatour
HOBBYOP

6 months ago

gotcha, thanks! i'll also try to setup some postgres metrics ingest over the week, just in case the extra metrics help


brody
EMPLOYEE

6 months ago

Thanks!


skyaviatour
HOBBYOP

5 months ago

Volume performance seems fine, but there might be an issue with networking now? It seems that on opening new connections it resolves slowly, but any subsequent database commands run fine. At the very least Redis stats don't seem to indicate a particular latency issue (p99 for XADD is in the microseconds order of magniture)

1388306685982019600


skyaviatour
HOBBYOP

5 months ago

not sure if it's related but sometimes project logs take almost a minute to load as well on the dashboard


skyaviatour
HOBBYOP

5 months ago

1388306996012515300


brody
EMPLOYEE

5 months ago

resolves slowly, as in dns lookup?


brody
EMPLOYEE

5 months ago

also, what timezone is that timestamp?


skyaviatour
HOBBYOP

5 months ago

UTC-3, sorry! and i'm unsure about where exactly it could be failing. let me see if i can get some more detailed metrics


skyaviatour
HOBBYOP

5 months ago

it doesn't seem to be happening anymore, at least


brody
EMPLOYEE

5 months ago

It could be DNS, we saw a massive spike in lookup time from 2025-06-27T23:07:35.863Z to 2025-06-28T00:02:43.649Z (UTC)


skyaviatour
HOBBYOP

5 months ago

that would track, i have these logs from around 23:12 (UTC) onward

1388332157939880000


skyaviatour
HOBBYOP

5 months ago

which correlate to traces with long networking pauses


brody
EMPLOYEE

5 months ago

is there any senario in your code where an XADD would open a new connection (and thus resolve the private domain again)?


brody
EMPLOYEE

5 months ago

because when looking at data for RTT minus dns lookup times, the actual transport times are fine


skyaviatour
HOBBYOP

5 months ago

that's what i'm trying to figure out. going just by the code i'm using redis connection pooling, and the prometheus metrics don't show a fluctuation in the amount of open connections over time


brody
EMPLOYEE

5 months ago

i think its a safe bet to say it was dns lookup time


brody
EMPLOYEE

5 months ago

and in that regard, ive spoken to one of our infra engineers and they identified the issue


skyaviatour
HOBBYOP

5 months ago

makes sense. i think i identified my part of the issue as the redis python client having multiple ways of handling connection pools, one of them auto-closing connections after usage 😅


skyaviatour
HOBBYOP

5 months ago

so a different handling would've lessened the user impact of this, since my postgres config already uses long-lived connections


brody
EMPLOYEE

5 months ago

yeah a pool min greater than zero would have helped, even if there isn't increased latency, dns lookup times arent the best they could be


skyaviatour
HOBBYOP

5 months ago

i'll keep it in mind for future improvements. turns out the python redis lib doesn't expose an evident way to initialize a pool with a minimum amount of connections. i have a vague idea of how to do it manually, but i'd rather sleep through the headache beforehand lol


brody
EMPLOYEE

5 months ago

sounds good!


Loading...