3 months ago
Hi Railway team, we’re investigating severe latency/stalls that appear correlated with private networking tonight. This has direct impact on customers and is severly damaging our brand.
Project / environment:
Project: aee72b99-ade1-4bd5-97b3-a391923eec04
Environment: fc002695-6c6e-4890-bd9c-2577b2407b2c
Server service: d0f41034-db0f-43ad-b897-41cc4c98795c
Chatbot service: 5ceef4ce-6f9e-49b5-9f8d-05f76e5b0980
Redis service: 5ea4a93f-98fa-4c83-9eac-920ef21140ab
Both app services connect to this Redis over Railway private networking.
What we observed:
Repeated high p99 response-time spikes (many requests ~25–30s), while p50 remains low.
In isolated traces, some requests remain open for very long durations before eventually returning success (not failing fast).
CPU and memory on services stay mostly flat during incidents (suggesting I/O wait/stall, not app saturation).
Pattern persisted for a large part of the night.
Why we suspect platform/network path
The issue appears simultaneously across multiple services sharing the same private network path / Redis dependency.
No corresponding CPU/RAM pressure or app-level crash loops on our side.
Could you please check:
private networking latency/packet loss/jitter for this project+env during the impacted window,
any Redis internal networking issues (connection stalls, routing instability, degraded node),
whether there were incidents/deployments affecting private network performance in this region.
If useful, we can provide exact UTC timestamps and trace IDs for the worst spikes.
Thanks.
Attachments
0 Replies