Private networking latency spikes causing long request stalls
jbccc
PROOP

3 months ago

Hi Railway team, we’re investigating severe latency/stalls that appear correlated with private networking tonight. This has direct impact on customers and is severly damaging our brand.

Project / environment:

Project: aee72b99-ade1-4bd5-97b3-a391923eec04

Environment: fc002695-6c6e-4890-bd9c-2577b2407b2c

Server service: d0f41034-db0f-43ad-b897-41cc4c98795c

Chatbot service: 5ceef4ce-6f9e-49b5-9f8d-05f76e5b0980

Redis service: 5ea4a93f-98fa-4c83-9eac-920ef21140ab

Both app services connect to this Redis over Railway private networking.

What we observed:

Repeated high p99 response-time spikes (many requests ~25–30s), while p50 remains low.

In isolated traces, some requests remain open for very long durations before eventually returning success (not failing fast).

CPU and memory on services stay mostly flat during incidents (suggesting I/O wait/stall, not app saturation).

Pattern persisted for a large part of the night.

Why we suspect platform/network path

The issue appears simultaneously across multiple services sharing the same private network path / Redis dependency.

No corresponding CPU/RAM pressure or app-level crash loops on our side.

Could you please check:

private networking latency/packet loss/jitter for this project+env during the impacted window,

any Redis internal networking issues (connection stalls, routing instability, degraded node),

whether there were incidents/deployments affecting private network performance in this region.

If useful, we can provide exact UTC timestamps and trace IDs for the worst spikes.

Thanks.

Screenshot_2026-02-27_at_01.54.19.png

Attachments

$30 Bounty

5 Replies

andreahlert
PRO

2 months ago

Hello JBCCC!

This looks like an internal infrastructure issue that only Railway's engineering team can investigate, since it involves private networking diagnostics, packet loss metrics, and Redis node health on their side.

I'd suggest reaching out directly to Railway support or opening a priority ticket rather than posting it as a public bounty, as this requires platform-level access and tooling that the community won't have visibility into. In the meantime, running redis-cli--latency-history from within your services and collecting SLOWLOG entries during the spike windows could give the team useful data points to speed up their investigation.

Hope I was useful somehow.


andreahlert

Hello JBCCC! This looks like an internal infrastructure issue that only Railway's engineering team can investigate, since it involves private networking diagnostics, packet loss metrics, and Redis node health on their side. I'd suggest reaching out directly to Railway support or opening a priority ticket rather than posting it as a public bounty, as this requires platform-level access and tooling that the community won't have visibility into. In the meantime, running redis-cli--latency-history from within your services and collecting SLOWLOG entries during the spike windows could give the team useful data points to speed up their investigation. Hope I was useful somehow.

2 months ago

This was made a bounty because we have confirmed that the latency is not on our end, it is within the user's stack.


brody

This was made a bounty because we have confirmed that the latency is not on our end, it is within the user's stack.

andreahlert
PRO

2 months ago

Thanks brody!


domehane
FREE

2 months ago

Hello jbccc,

since brody confirmed it's on your end and not their infra, the symptoms you described point directly to your redis client config , specifically missing or too-high timeouts and possibly connection pool issues. when p50 is fine but p99 hits 25-30s with cpu flat, it means most requests are fine but some are just sitting there waiting, not crashing, not hitting compute limits, just stalling. that's a client-side config problem, not a network problem

the first thing to check is whether you have a command timeout set on your redis client. if you don't, a stalled connection will just hang for as long as the os allows, which matches exactly what you're seeing. set a command timeout to something like 3-5 seconds so it fails fast instead of stalling 30s.

second, check if socket keepalive is enabled on your redis client. without it, idle tcp connections can silently die at the network layer and your client won't know until it tries to use them, causing exactly this kind of stall.

third, run "slowlog get 25" on your redis instance and check if any commands are taking a long time. redis is single-threaded so one slow command blocks everything behind it.

so this is directly what the symptom profile you described maps to based on how redis clients behave

Hope this help you


domehane

Hello **jbccc**, since **brody** confirmed it's on your end and not their infra, the symptoms you described point directly to your redis client config , specifically missing or too-high timeouts and possibly connection pool issues. when p50 is fine but p99 hits 25-30s with cpu flat, it means most requests are fine but some are just sitting there waiting, not crashing, not hitting compute limits, just stalling. that's a client-side config problem, not a network problem the first thing to check is whether you have a command timeout set on your redis client. if you don't, a stalled connection will just hang for as long as the os allows, which matches exactly what you're seeing. set a command timeout to something like 3-5 seconds so it fails fast instead of stalling 30s. second, check if socket keepalive is enabled on your redis client. without it, idle tcp connections can silently die at the network layer and your client won't know until it tries to use them, causing exactly this kind of stall. third, run "slowlog get 25" on your redis instance and check if any commands are taking a long time. redis is single-threaded so one slow command blocks everything behind it. so this is directly what the symptom profile you described maps to based on how redis clients behave Hope this help you

domehane
FREE

2 months ago

also i have a question , what redis client library are you using? that'll confirm the exact config keys to look at


Welcome!

Sign in to your Railway account to join the conversation.

Loading...