Hi Railway Team,I am experiencing severe, intermittent latency spikes on my API service that do not correlate with resource exhaustion. My p99 response times are reaching 10s to 30s, while the p50 remains stable and low.Observed Behavior:Low Resource Usage: Both API and Database metrics show CPU and RAM usage at less than 5% during the spikes.Randomness: The spikes are not isolated to a specific endpoint; they occur across the entire API surface.p99 vs p50 Divergence: While the median response time is healthy, the tail latency (p99) suggests requests are being queued or stalled before reaching the application logic.Database Calmness: Database logs show no long-running locks or high load during these periods.I spent 2 days on this but found nothing on my side. I believe this is a Railway infrastructure problem.

Latency Spikes (up to 30s) across API Service

Anonymous

PROOP

3 days ago

Hi Railway Team,

I am experiencing severe, intermittent latency spikes on my API service that do not correlate with resource exhaustion. My p99 response times are reaching 10s to 30s, while the p50 remains stable and low.

Observed Behavior:

Low Resource Usage: Both API and Database metrics show CPU and RAM usage at less than 5% during the spikes.
Randomness: The spikes are not isolated to a specific endpoint; they occur across the entire API surface.
p99 vs p50 Divergence: While the median response time is healthy, the tail latency (p99) suggests requests are being queued or stalled before reaching the application logic.
Database Calmness: Database logs show no long-running locks or high load during these periods.

I spent 2 days on this but found nothing on my side. I believe this is a Railway infrastructure problem.

Attachments

Screenshot%...

$20 Bounty

11 Replies

Status changed to Open Railway • 3 days ago

domehane

FREE

3 days ago

Hello,

check your http logs in railway and compare totalduration vs upstreamrqduration on one of the slow requests. if totalduration is high but upstreamrqduration is low, the stall is happening inside railway's proxy, not your app

sjcodebook

PRO

3 days ago

I am also seeing the same issue, and this started happening out of the blue.

Attachments

Screenshot%...

salhasanain

HOBBY

3 days ago

I have the same problem rn

sjcodebook

PRO

3 days ago

I think there is some problem with their internal proxy

sjcodebook

PRO

3 days ago

Yes, I can confirm I connected my node.js server with the database public URL instead of their internal URL, and it is working fine now

Anonymous

PROOP

3 days ago

My totalDuration = upstreamRqDuration, and as you can see in row #149 in the new screenshot, one request is randomly nuking the p99 at 37 seconds while everything else around it is flying at ~20ms.

It’s the same API and the same logic, so I can't see how this is a code bottleneck. I checked my cronjobs and queues to see if something was hogging the DB connections at that exact time, but there's nothing special going on.

I’m running 2 machines for the API and 1 for the worker/cron. I suspect there might be a DB connectivity issue or something at the infra level

Attachments

Screenshot%...

My totalDuration = upstreamRqDuration, and as you can see in row #149 in the new screenshot, one request is randomly nuking the p99 at 37 seconds while everything else around it is flying at ~20ms.It’s the same API and the same logic, so I can't see how this is a code bottleneck. I checked my cronjobs and queues to see if something was hogging the DB connections at that exact time, but there's nothing special going on.I’m running 2 machines for the API and 1 for the worker/cron. I suspect there might be a DB connectivity issue or something at the infra level

domehane

FREE

3 days ago

okay so totalduration = upstreamrqduration confirms this is happening inside your app , railway's proxy is clean

and since the spike hits randomly on the same endpoint with no resource pressure, before assuming anything can you check two things: first look at your db slow query logs at exactly 2026-05-10 16:06:35 and see if anything was running at that moment, second what db and connection pool settings are you using?

Anonymous

PROOP

2 days ago

Hey, thanks for the detailed breakdown. Connection pool was my first thought on this problem too. But after 2-3 days of deep research and tracking, I don't think it is.

To address your points first:

Regions & Networking: Everything (DB, API, Workers) is in the Railway Singapore region using internal networking
Connection Pool: We tracked the exact timestamps of the spikes. As you can see in the screenshot, during a 60-second global API freeze (the 56892ms and 62925ms durations), there were literally only 2 active requests hitting the API. Our API connection pool has 50 slots, and the DB allows 300, so it definitively wasn't pool starvation.

Architectural context:

I have 2 dedicated machines just for the API (zero third-party network calls happen here).

I have 1 separate machine for Cron jobs and Consumer logic (this handles third-party calls if needed).

I've verified multiple times that the Cron and Consumer do not lock the DB. This problem only showed up recently (we have been using this architecture for more than 6 months on Railway), and it paralyzes all API endpoints, even those that have absolutely nothing to do with the Cron jobs or Consumers.

I think same problem here
https://station.railway.com/questions/latency-spike-1b5123e7

Hey, thanks for the detailed breakdown. Connection pool was my first thought on this problem too. But after 2-3 days of deep research and tracking, I don't think it is.To address your points first:Regions & Networking: Everything (DB, API, Workers) is in the Railway Singapore region using internal networkingConnection Pool: We tracked the exact timestamps of the spikes. As you can see in the screenshot, during a 60-second global API freeze (the 56892ms and 62925ms durations), there were literally only 2 active requests hitting the API. Our API connection pool has 50 slots, and the DB allows 300, so it definitively wasn't pool starvation.Architectural context:I have 2 dedicated machines just for the API (zero third-party network calls happen here).I have 1 separate machine for Cron jobs and Consumer logic (this handles third-party calls if needed).I've verified multiple times that the Cron and Consumer do not lock the DB. This problem only showed up recently (we have been using this architecture for more than 6 months on Railway), and it paralyzes all API endpoints, even those that have absolutely nothing to do with the Cron jobs or Consumers.I think same problem herehttps://station.railway.com/questions/latency-spike-1b5123e7

Anonymous

PROOP

2 days ago

I forgot the screenshot

Attachments

Screenshot%...

Anonymous

PROOP

2 days ago

Same problems here
https://station.railway.com/questions/backend-response-time-is-super-slow-7f12073d

Anonymous

PROOP

2 hours ago

Hey Railway team, could someone please follow up on this issue? We’ve been seeing major API response time spikes lately, and it’s starting to affect our service reliability.

We’re a small startup with a few active clients, so building trust and having a reliable platform is really important for us right now.

If this issue can’t be resolved, we have to move to another platform. I’ve also noticed that quite a few other users seem to be facing the same problem, but I haven’t seen any clear fix or updates yet.

Would really appreciate any help or update on this. Thanks!

Welcome!