Help Diagnosing Nightly Response Time Spike
pedrommcarrasco
PROOP

a year ago

Good evening,

I’m reaching out with a somewhat vague issue in the hope that someone here has experienced something similar and can offer insights or advice.

To get straight to the point, we have a PostgreSQL database connected to a Node.js API, with Redis handling background task management. Every night, starting at 11:30 PM UTC and lasting for approximately 30 minutes, we encounter a significant response time spike. During this period, our API’s average response time jumps from milliseconds to over 60 seconds, often resulting in timeouts.

This recurring issue is severely impacting performance, and I’m wondering if anyone has encountered a similar scenario. Could this be related to a PostgreSQL default configuration or some overlooked process running in the background?

Any suggestions or shared experiences would be greatly appreciated!

Thank you!

Solved

18 Replies

pedrommcarrasco
PROOP

a year ago

871738a3-42f1-4feb-89e8-cd361440240e


mikexie360
PRO

a year ago

If it is happening at the same time every night. It could be a CRON job or something weird with the datetime. I would check there.


pedrommcarrasco
PROOP

a year ago

As far as we can tell, there isn’t a CRON job scheduled to run at those times—at least, not one that we configured. Could you clarify what you mean by ‘something weird with the datetime’? A bit more detail would help us understand and investigate further.


maddsua
HOBBY

a year ago

What probably was a typo and they meant a database


maddsua
HOBBY

a year ago

Anyhow, have you noticed any usage metrics deviations during these slowdown periods?


pedrommcarrasco
PROOP

a year ago

According to Railway’s dashboard (covering the last 7 days) and Axiom’s concrete user usage data, there doesn’t appear to be any correlation. I’ve also attached a report from Axiom showing the average elapsed time during these spikes for further reference.

p.s.: Red is the Postgres database on Railway

1325790928505409800
1325790928824434700


brody
EMPLOYEE

a year ago

looks like there is definitely some kind of concentrated activity, on two kinds of metrics there


pedrommcarrasco
PROOP

a year ago

Nothing unusual on Railway; however, for reference, it’s expected for our product to see increased activity during those hours, as the majority of our users are based in the US.

1325842233177411600


brody
EMPLOYEE

a year ago

do you have any tracing for debugging in your app?


pedrommcarrasco
PROOP

a year ago

It’s a mobile app designed specifically for iOS and macOS. From our logs, nothing unusual stands out. However, the fact that this issue consistently occurs around the same time suggests it’s unlikely to be user-driven—especially since activity spikes to extreme levels within just 60 seconds. For example, last night, everything was functioning as expected at 11:20, but by 11:30, things suddenly went haywire


brody
EMPLOYEE

a year ago

I shall take that as a no.

I would highly recommend adding tracing to your backend application, it will give you far more insight and help you more than anything anyone here could say


pedrommcarrasco
PROOP

a year ago

@rubenamorim might be in a better position to answer you that, I'm sorry but I'm mostly focused on the client side of it 😅


pedrommcarrasco
PROOP

a year ago

Closing this post as we believe we've found the source of this issue. Regardless, thank you for mentioning tracing, we'll add it soon.


brody
EMPLOYEE

a year ago

well please do tell us the source of the issue!?


pedrommcarrasco
PROOP

a year ago

It seemed to be a mix of autovaccum and unoptimized queries. Once we've optimized them with indexes things seem to be working out better than ever.


brody
EMPLOYEE

a year ago

ah gotcha, thank you for sharing!


brody
EMPLOYEE

a year ago

I shall mark as solved


brody
EMPLOYEE

a year ago

!s


Status changed to Solved brody about 1 year ago


Loading...