Potential volume write degradation

skyaviatour

HOBBYOP

9 months ago

Project ID: 02edd57b-a5c8-43a8-9e48-93212c876e12
Region: US-West Metal (both database and application)

I'm experiencing random spikes in write performance degradation with a volume-backed Postgres instance.

In order to discard a database specific issue I ran some load testing benchmarks on a freshly backed-up database on my computer, but can't seem to replicate it.

32 Replies

skyaviatour

HOBBYOP

9 months ago

I've noticed there's an incident possibly related to this that was resolved a couple days ago, however in my case it seems to only impact writes

skyaviatour

HOBBYOP

9 months ago

I can also confirm it's not an issue with writes volume (even if I don't have any metrics to back this up), the API only exposes 2 write operations and both were at a pace of 1 per minute or so at that moment

skyaviatour

HOBBYOP

9 months ago

some more info: extra traces with overall duration (rightmost column) to show variance, details of an overall well performing trace, and details of the worst performing trace of that time period

1386728419982836000
1386728420393619500
1386728420716839000

brody

EMPLOYEE

9 months ago

So this is still ongoing as of right now?

skyaviatour

HOBBYOP

9 months ago

i'm standing up a staging environment to confirm, as writes in production entirely depend on some live streams being active, which won't until friday 😅

skyaviatour

HOBBYOP

9 months ago

but the latest i can confirm it being ongoing is earlier today at 00:44 UTC-3

brody

EMPLOYEE

9 months ago

Gotcha, please keep me updated.

skyaviatour

HOBBYOP

9 months ago

load testing results look fine on staging, still some high latency cases but nowhere nearly as bad as last night. i could attribute these to natural performance cases

1386745668441477000

skyaviatour

HOBBYOP

9 months ago

i'll monitor again this weekend just in case, but for now i suppose this can be considered solved

brody

EMPLOYEE

9 months ago

Ill keep it open, so please report back on any issues!

skyaviatour

HOBBYOP

9 months ago

gotcha, thanks! i'll also try to setup some postgres metrics ingest over the week, just in case the extra metrics help

brody

EMPLOYEE

9 months ago

Thanks!

skyaviatour

HOBBYOP

8 months ago

Volume performance seems fine, but there might be an issue with networking now? It seems that on opening new connections it resolves slowly, but any subsequent database commands run fine. At the very least Redis stats don't seem to indicate a particular latency issue (p99 for XADD is in the microseconds order of magniture)

1388306685982019600

skyaviatour

HOBBYOP

8 months ago

not sure if it's related but sometimes project logs take almost a minute to load as well on the dashboard

skyaviatour

HOBBYOP

8 months ago

1388306996012515300

brody

EMPLOYEE

8 months ago

resolves slowly, as in dns lookup?

brody

EMPLOYEE

8 months ago

also, what timezone is that timestamp?

skyaviatour

HOBBYOP

8 months ago

UTC-3, sorry! and i'm unsure about where exactly it could be failing. let me see if i can get some more detailed metrics

skyaviatour

HOBBYOP

8 months ago

it doesn't seem to be happening anymore, at least

brody

EMPLOYEE

8 months ago

It could be DNS, we saw a massive spike in lookup time from 2025-06-27T23:07:35.863Z to 2025-06-28T00:02:43.649Z (UTC)

skyaviatour

HOBBYOP

8 months ago

that would track, i have these logs from around 23:12 (UTC) onward

1388332157939880000

skyaviatour

HOBBYOP

8 months ago

which correlate to traces with long networking pauses

brody

EMPLOYEE

8 months ago

is there any senario in your code where an XADD would open a new connection (and thus resolve the private domain again)?

brody

EMPLOYEE

8 months ago

because when looking at data for RTT minus dns lookup times, the actual transport times are fine

skyaviatour

HOBBYOP

8 months ago

that's what i'm trying to figure out. going just by the code i'm using redis connection pooling, and the prometheus metrics don't show a fluctuation in the amount of open connections over time

brody

EMPLOYEE

8 months ago

i think its a safe bet to say it was dns lookup time

brody

EMPLOYEE

8 months ago

and in that regard, ive spoken to one of our infra engineers and they identified the issue

skyaviatour

HOBBYOP

8 months ago

makes sense. i think i identified my part of the issue as the redis python client having multiple ways of handling connection pools, one of them auto-closing connections after usage 😅

skyaviatour

HOBBYOP

8 months ago

so a different handling would've lessened the user impact of this, since my postgres config already uses long-lived connections

brody

EMPLOYEE

8 months ago

yeah a pool min greater than zero would have helped, even if there isn't increased latency, dns lookup times arent the best they could be

skyaviatour

HOBBYOP

8 months ago

i'll keep it in mind for future improvements. turns out the python redis lib doesn't expose an evident way to initialize a pool with a minimum amount of connections. i have a vague idea of how to do it manually, but i'd rather sleep through the headache beforehand lol

brody

EMPLOYEE

8 months ago

sounds good!