3 months ago
We've been observing some database queries that typically execute in less than 10ms occasionally take upwards of 10s over the past few days.
We haven't been able to identify anything in our application or database config that has led to this new behavior, so we're wondering if there have been any changes in Railway's infrastructure that may be relevant.
For context, we were experiencing issues with Metal about a month ago, and since reverting back to the legacy GCP servers, our typical DB performance has been and continues to be good. The performance issue that we're now experiencing seems to be sporadic, starting yesterday morning, and the patterns are similar across instances of our application in multiple Railway environments.
Have there been any recent changes on Railway's end that could be causing this?
10 Replies
3 months ago
Hey there! We've found the following might help you get unblocked faster:
If you find the answer from one of these, please let us know by solving the thread!
3 months ago
Hello,
That is extremely odd. I had a look at the metrics for the last few days for the host your Postgres database is on, and that host is extremely cold while also being stable on all metrics. There are very few other containers on the host as well, so unfortunately, no smoking guns from our side that I could help you with.
But with that said, since you last talked with Angelo, we have significantly improved the performance on Metal. They are now shown to outperform Legacy significantly, which, for the record, was far from the case when you last spoke to Angelo, so I would honestly recommend switching back to Metal now.
Status changed to Awaiting User Response Railway • 3 months ago
3 months ago
Thanks for checking, and thanks for the update about Metal - we'll try switching back at some point soon.
It's likely that whatever is going on with these queries is on our end so we'll keep investigating.
Status changed to Awaiting Railway Response Railway • 3 months ago
3 months ago
Sounds good, let me know how your Metal experience is when you switch back!
Status changed to Awaiting User Response Railway • 3 months ago
3 months ago
We've continued to investigate the database issues and have still been unable to resolve them.
We do intend to migrate back to Metal soon, but we'd like to keep investigating this first in an effort to truly understand the root cause.
One strange thing about the issue is that the slow queries spike at the same time across all of our environments (prod, staging, and dev), despite the three environments having different enough usage patterns that we wouldn't expect them to be so tightly correlated.
Our recent spikes occurred on:
Aug 27 between 09:00 and 12:00 PST
Aug 28 between 07:00 and 11:00 PST
Sep 2 between 12:00 and 20:00 PST
Just to check again, is there anything on Railway's end that might be relevant here? Or have you seen similar reports from other customers?
Status changed to Awaiting Railway Response Railway • 3 months ago
3 months ago
Hello,
Could you link me to the database services in the environments you mentioned so that I can check the host metrics for the given timestamps to see if I can find any possible correlations on our end?
As for similar reports, no, we haven't seen any. You basically have the legacy hosts to yourself, as they are currently operating at 1% of their typical capacity.
Best,
Brody
Status changed to Awaiting User Response Railway • 3 months ago
3 months ago
Hi Brody, I'm one of James's coworkers. We've observed these correlated slowdowns from:
- dev database (environment id 1bbd16ba-7683-4b89-9dba-87873e04fb07, service id bd2f05ba-5b70-4e38-abd1-eb14018fb136)
- staging database (environment id 7e771c7e-0944-464f-b372-eb9cc03bb3d7, service id bd2f05ba-5b70-4e38-abd1-eb14018fb136)
- production database (environment id fcc2d6e8-7c56-4b9e-8e05-5b8b0cb180dc, service id bd2f05ba-5b70-4e38-abd1-eb14018fb136)
Our data is coming from looking at Sentry-generated spans in our backend server; we've enabled slow query logging in each of those Postgreses to get more precise data, but haven't seen a slowdown since doing so. That means the times we're seeing include network round trips as well as the query execution time within Postgres itself. To our knowledge, we didn't change anything relevant in the lead-up to starting to see these slowdowns, and they've been correlated across environments that share very little in common (different use patterns, different deploy cadences etc). It makes sense that the host isn't seeing high CPU/memory utilization if not much is running on them - is it possible there are intermittent network or disk latencies involved?
Any advice or direction you could give us to help understand what we're seeing would be helpful.
Status changed to Awaiting Railway Response Railway • 3 months ago
3 months ago
I spent a good amount of time staring at dozens of metrics for the host your database is on, and there were no out of the ordinary spikes for your given timestamps. For that matter, there are no out of the ordinary spikes on that host for the entire last month either.
I would like to do all I can for you. I just don't know what else I could look at on my side. Everything I look at is telling me that the host is doing basically nothing on every metric point.
Status changed to Awaiting User Response Railway • 3 months ago
3 months ago
Thanks for continuing to dig into this with us. We're going to try moving our pre-production environments back over to Metal and continue to monitor the DB performance. I'll post more updates here if we having more findings on our end.
Status changed to Awaiting Railway Response Railway • 3 months ago
Status changed to Awaiting User Response Railway • 3 months ago
3 months ago
This thread has been marked as solved automatically due to a lack of recent activity. Please re-open this thread or create a new one if you require further assistance. Thank you!
Status changed to Solved Railway • 3 months ago