3 days ago
Hi Railway Team,
I am experiencing severe, intermittent latency spikes on my API service that do not correlate with resource exhaustion. My p99 response times are reaching 10s to 30s, while the p50 remains stable and low.
Observed Behavior:
Low Resource Usage: Both API and Database metrics show CPU and RAM usage at less than 5% during the spikes.
Randomness: The spikes are not isolated to a specific endpoint; they occur across the entire API surface.
p99 vs p50 Divergence: While the median response time is healthy, the tail latency (p99) suggests requests are being queued or stalled before reaching the application logic.
Database Calmness: Database logs show no long-running locks or high load during these periods.
I spent 2 days on this but found nothing on my side. I believe this is a Railway infrastructure problem.
Attachments
11 Replies
Status changed to Open Railway • 3 days ago
3 days ago
Hello,
check your http logs in railway and compare totalduration vs upstreamrqduration on one of the slow requests. if totalduration is high but upstreamrqduration is low, the stall is happening inside railway's proxy, not your app
3 days ago
I am also seeing the same issue, and this started happening out of the blue.
Attachments
3 days ago
I have the same problem rn
3 days ago
I think there is some problem with their internal proxy
3 days ago
Yes, I can confirm I connected my node.js server with the database public URL instead of their internal URL, and it is working fine now
3 days ago
My totalDuration = upstreamRqDuration, and as you can see in row #149 in the new screenshot, one request is randomly nuking the p99 at 37 seconds while everything else around it is flying at ~20ms.
It’s the same API and the same logic, so I can't see how this is a code bottleneck. I checked my cronjobs and queues to see if something was hogging the DB connections at that exact time, but there's nothing special going on.
I’m running 2 machines for the API and 1 for the worker/cron. I suspect there might be a DB connectivity issue or something at the infra level
Attachments
My totalDuration = upstreamRqDuration, and as you can see in row #149 in the new screenshot, one request is randomly nuking the p99 at 37 seconds while everything else around it is flying at ~20ms.It’s the same API and the same logic, so I can't see how this is a code bottleneck. I checked my cronjobs and queues to see if something was hogging the DB connections at that exact time, but there's nothing special going on.I’m running 2 machines for the API and 1 for the worker/cron. I suspect there might be a DB connectivity issue or something at the infra level
3 days ago
okay so totalduration = upstreamrqduration confirms this is happening inside your app , railway's proxy is clean
and since the spike hits randomly on the same endpoint with no resource pressure, before assuming anything can you check two things: first look at your db slow query logs at exactly 2026-05-10 16:06:35 and see if anything was running at that moment, second what db and connection pool settings are you using?
2 days ago
Hey, thanks for the detailed breakdown. Connection pool was my first thought on this problem too. But after 2-3 days of deep research and tracking, I don't think it is.
To address your points first:
Regions & Networking: Everything (DB, API, Workers) is in the Railway Singapore region using internal networking
Connection Pool: We tracked the exact timestamps of the spikes. As you can see in the screenshot, during a 60-second global API freeze (the
56892msand62925msdurations), there were literally only 2 active requests hitting the API. Our API connection pool has 50 slots, and the DB allows 300, so it definitively wasn't pool starvation.
Architectural context:
I have 2 dedicated machines just for the API (zero third-party network calls happen here).
I have 1 separate machine for Cron jobs and Consumer logic (this handles third-party calls if needed).
I've verified multiple times that the Cron and Consumer do not lock the DB. This problem only showed up recently (we have been using this architecture for more than 6 months on Railway), and it paralyzes all API endpoints, even those that have absolutely nothing to do with the Cron jobs or Consumers.
I think same problem here
https://station.railway.com/questions/latency-spike-1b5123e7
Hey, thanks for the detailed breakdown. Connection pool was my first thought on this problem too. But after 2-3 days of deep research and tracking, I don't think it is.To address your points first:Regions & Networking: Everything (DB, API, Workers) is in the Railway Singapore region using internal networkingConnection Pool: We tracked the exact timestamps of the spikes. As you can see in the screenshot, during a 60-second global API freeze (the 56892ms and 62925ms durations), there were literally only 2 active requests hitting the API. Our API connection pool has 50 slots, and the DB allows 300, so it definitively wasn't pool starvation.Architectural context:I have 2 dedicated machines just for the API (zero third-party network calls happen here).I have 1 separate machine for Cron jobs and Consumer logic (this handles third-party calls if needed).I've verified multiple times that the Cron and Consumer do not lock the DB. This problem only showed up recently (we have been using this architecture for more than 6 months on Railway), and it paralyzes all API endpoints, even those that have absolutely nothing to do with the Cron jobs or Consumers.I think same problem herehttps://station.railway.com/questions/latency-spike-1b5123e7
2 days ago
I forgot the screenshot
Attachments
2 days ago
2 hours ago
Hey Railway team, could someone please follow up on this issue? We’ve been seeing major API response time spikes lately, and it’s starting to affect our service reliability.
We’re a small startup with a few active clients, so building trust and having a reliable platform is really important for us right now.
If this issue can’t be resolved, we have to move to another platform. I’ve also noticed that quite a few other users seem to be facing the same problem, but I haven’t seen any clear fix or updates yet.
Would really appreciate any help or update on this. Thanks!