3 months ago
Hello,
We run a critical production service on Railway and we’re seeing a worrying pattern recently.
Today we had another incident: for over 1 hour our service became unreachable and all public URLs returned 502. This started suddenly, with no deploy or config change on our side beforehand.
A redeploy immediately fixed it, and the service is stable again for now.
What’s concerning is that similar outages have happened in the recent past, and each time the business impact is significant.
Have anyone else experienced the same kind of 502 outage recently?
Railway team: can you confirm whether there were any platform incidents that could explain this, and what the most likely root causes are when a service only recovers after a redeploy? We’d also appreciate concrete mitigation steps to prevent recurrence.
If we can’t get a clear RCA and a reliable prevention plan, we’ll have to seriously consider moving to another provider.
Thanks
Attachments
2 Replies
3 months ago
Hi,
First, we just wanted to let you know we take reliability very seriously. We've made a lot of changes over the past month to ensure our platform is ahead of all the changes that are happening industry-wide. We know it resulted in disruptions over the past month and for that we are sorry.
As to your specific incident here, we looked into the HTTP logs for this incident. Before the full outage, one of your two instances was already unreachable around 15:45 UTC, and response times on the remaining instance were elevated (avg 3.3s, peaks at 8-10s).
By 15:50 UTC both instances stopped accepting connections, resulting in 502s until your redeploy at 17:25 UTC. You can verify this by checking your CPU metrics by replica - you'll see only one replica was doing work during that window. There was zero proxy overhead on our side throughout - the timeouts were entirely from your instances not accepting TCP connections.
Our network flow logs show your service connects to Supabase via port 6543 (Supavisor pooler). The pattern of instances dying sequentially under elevated response times is consistent with the app becoming blocked on database connections. Your deployment logs also show increasing latency on Supabase Auth requests in the minutes before the outage, with gaps between auth processing calls growing from ~37ms up to ~4s. You can check these in your deploy logs for that window.
We'd recommend checking your Supabase dashboard for connection pool exhaustion or latency around 15:45-16:05 UTC on March 5, and adding connection timeouts on your DB client so it fails fast instead of hanging the entire process.
Let us know if you need further help, and again, thanks for choosing Railway.
Sam
Status changed to Awaiting User Response Railway • 3 months ago
3 months ago
Hello Sam,
Thanks for the quick response, appreciate it.
This service is critical for us, and overall platform stability matters a lot. The recent disruptions have been pretty painful on our side, so we’re really counting on Railway’s reliability improving and staying consistently strong going forward.
Thanks again.
Status changed to Awaiting Railway Response Railway • 3 months ago
Status changed to Solved sam-a • 3 months ago