6 days ago
Dear Railway Support Team,
I am writing to formally raise a critical concern regarding the reliability of Railway's infrastructure, which is directly and repeatedly impacting our production application and our end customers.
We operate a real-time tour guide platform that relies heavily on persistent WebSocket connections for live audio streaming, transcription, and translation between tour guides and tourist groups. The nature of our service means that any interruption, however brief, is immediately visible and disruptive to paying customers in the middle of a live experience.
Over recent weeks, we have been experiencing the following issues on a frequent and recurring basis:
- Unexpected service downtime. Our application becomes completely unreachable for periods of time, with no prior warning or communication from Railway. These outages force our customers to restart their sessions mid-tour, which is unacceptable in a live, time-sensitive environment.
- WebSocket connection drops during active sessions. We are observing repeated, unprovoked disconnections on active WebSocket connections. These drops occur during peak usage hours and cannot be attributed to client-side network issues, as they affect multiple concurrent users simultaneously and resolve themselves once Railway's network stabilises.
- Significant message delivery latency spikes. Even when connections remain technically open, we periodically experience severe latency spikes that delay the delivery of real-time audio and text data. In a live translation scenario, a delay of even a few seconds renders the service unusable.
The cumulative effect of these issues is serious. We have had multiple instances where tour groups — composed of paying customers who have purchased tickets specifically for our guided experience — have been left with a broken, degraded, or completely non-functional product mid-session. This has resulted in customer complaints, refund requests, and lasting damage to our reputation.
We want to be transparent: we have invested significant engineering effort into building resilience on our side, including Redis-backed Pub/Sub for multi-instance state sharing, reconnection logic, and health monitoring. Despite these measures, the root cause of the disruptions consistently traces back to Railway's network layer, not our application code.
We are asking Railway to:
Investigate the network stability of our deployment region and identify any infrastructure-level issues affecting WebSocket keep-alive and long-lived connections.
Provide transparency around any ongoing or historical incidents that may have contributed to the instability we are experiencing.
Clarify what SLA guarantees, if any, apply to our current plan — and what options exist for a higher-reliability tier that can meet the demands of a customer-facing, real-time production service.
We value Railway as a platform and would prefer to resolve this collaboratively. However, if the current level of reliability cannot be improved, we will be forced to evaluate alternative infrastructure providers to protect our business and our customers.
We look forward to your prompt response.
6 Replies
2 days ago
To look into this further, we will need exact full UTC timestamps from your end.
brody
To look into this further, we will need exact full UTC timestamps from your end.
2 days ago
On June 6, 2026, during the time window from 08:00 to 12:00 Athens time, Greece (which is 05:00 to 09:00 UTC) we observed WebSocket disconnections at different times. Also, between around 11:00 and 12:00 Athens time(08:00 to 09:00 UTC) the page started experiencing significant delays, and loading became very slow. Please note that this issue has occurred on other days as well, not only on this specific date.
Status changed to Awaiting Railway Response Railway • 2 days ago
a day ago
The disconnections you are seeing come from our edge layer recycling long-lived connections, not from your application. WebSocket connections are not guaranteed to stay open indefinitely, and our edge may close and recycle them, which closes all affected clients at the same time and is why the drops resolve on their own once connections re-establish. The most reliable approach is to keep your automatic reconnection logic, which you already have, and add redundancy by multiplexing data across multiple connections or by fronting your service with a Cloudflare tunnel so connections terminate there rather than at our edge. On reliability tiers, the Pro plan does not include a Service Level Agreement, and service credits for downtime are only available on Enterprise plans with a contractual SLA.
Status changed to Awaiting User Response Railway • 1 day ago
17 hours ago
Hi Noahd, thank you for the transparent explanation regarding the edge layer connection recycling. Regarding the Cloudflare tunnel workaround, are there any specific guides or best practices you recommend for setting this up with Railway to ensure the WebSocket termination works smoothly for a high-traffic real-time application? Additionally, while this explains the simultaneous WebSocket drops, it doesn't fully address the overall service downtime and severe latency spikes (slow page loading) we experienced during the same window (e.g., between 08:00 and 09:00 UTC on June 6). Could you clarify if the edge layer recycling is also responsible for this broader performance degradation, or if there is another underlying infrastructure issue we should be aware of?
Status changed to Awaiting Railway Response Railway • about 17 hours ago
13 hours ago
On the tunnel question, terminating WebSocket connections at Cloudflare moves the keep-alive boundary to their network and takes our edge connection recycling out of that path, which is the mechanism behind the simultaneous drops you saw. The tunnel configuration itself, including timeouts and connection tuning for high-traffic real-time traffic, lives on the Cloudflare side, so their cloudflared and Tunnel documentation is the right reference for the specific settings.
On the June 6 window, we reviewed telemetry for your region between 08:00 and 09:00 UTC, and for the wider morning period. Synthetic uptime checks and edge error levels for your region held steady through that window, with no spike and no values above the surrounding hours, and there was no incident affecting your region on that date. We did not find a separate underlying issue during that time. The slow loading and the connection drops trace to the same connection recycling behavior already described, so the reconnection and redundancy measures, or the Cloudflare tunnel, are the path to a smoother experience.
Status changed to Awaiting User Response Railway • about 13 hours ago
12 hours ago
Hi Noah, Thank you for the detailed follow-up and for taking the time to review the telemetry for the June 6 window. It is reassuring to know there wasn’t a wider infrastructure incident and that both the latency spikes and the simultaneous drops trace back to the same edge recycling behavior. This gives us a very clear understanding of the situation and a definitive path forward. We will dive into the Cloudflare documentation to set up the tunnel and move the keep-alive boundary to their network as suggested. We really appreciate your transparency and support in helping us troubleshoot this. Have a great day!
Status changed to Awaiting Railway Response Railway • about 12 hours ago
Status changed to Solved Railway • about 12 hours ago