Websockets repeatedly disconnected 10-11pm UTC

6 months ago

Every day from 10:03pm UTC to 10:45pm UTC, all websockets connected to my project disconnect & reconnect. This occurs every 5 minutes during the time period, then heals by itself. It has occurred for the past 5 days, ever since I launched the project.

The application is an Elixir Phoenix server. It does not crash or restart during this time. It talks to a Postgres database which sees increased load during the reconnects but nothing to make me think it is the cause. I have no cron jobs that would cause this behavior to my knowledge. There are no logs that I can find explaining why this would occur. I used Cloudflare for proxying these websockets to a different rust-based server running on a different hosting platform with no issue, so I'm confident it is not Cloudflare at fault.

Does Railway perform any network maintenance at this time? Or is there anything you can see in your logs as to why this is occurring? I've spent days trying to find an issue in my application code and would like to eliminate Railway's networking as a potential cause before continuing.

Project ID: 2a582b81-1f25-4953-91bc-bcae9e4e8496

Solved

14 Replies

6 months ago

We do cycle out our HTTP proxies every 24 hours, this looks to be the cause of what you are seeing.


6 months ago

There are only 15,000-20,000 websocket connections, so based on this graph it looks like Railway is:

  1. Putting all those connections on a single proxy (or restarting all proxies at the same time)

  2. On reconnect, the connections get routed to another proxy that still needs to be cycled.

  3. This somehow happens 8 times in a row.

Even if Railway cycles HTTP proxies every 24 hours (which is perfectly reasonable), I still find the repeated spikes unusual and concerning. It would also be appreciated if the load was spread in a way that not all sockets are lost at the same time.


6 months ago

If this could be surfaced to the networking team it'd be greatly appreciated.


6 months ago

We have eight proxies in the US-West2 region, and when we restart them, they are all staggered one after the other.

All the open connections to your application would be spread out among those 8 proxies.

And that is what these spikes are, each individual proxy restart.


Railway
BOT

6 months ago

Hello!

We've escalated your issue to our engineering team.

We aim to provide an update within 1 business day.

Please reply to this thread if you have any questions!

Status changed to Awaiting User Response Railway 6 months ago


6 months ago

I've raised this to Infra for additional comments.


6 months ago

Heard back from the networking team:

On reconnect, the connections get routed to another proxy that still needs to be cycled.

This is one of the main issues. We also aren't draining the connections as gracefully as we should be. When draining, we drop all connections simultaneously, instead of allowing gradual connection closures with H2 GOAWAY.

So it turns out that our networking team was already painfully aware of the issue, but they are all tied up in projects they had already started and committed to, and those need to be completed before tackling this issue, so for right now, there isn't anything we can do short term, but this will get improvements down the line.


6 months ago

Thanks, I really appreciate the update. It's very unfortunate behavior but I'm glad it's on the radar, and now that I know the cause I can look into how to mitigate it at the app level too :)


6 months ago

You aren't the only one reporting WebSocket connection closures either, so this will likely be next up after we bring IPv4 support to the private network.


Railway
BOT

6 months ago

✅ The ticket Websocket traffic disruptions has been marked as completed.


6 months ago

^ Ticket for the escalation to infra, not the ticket for the underlying issue.


Railway
BOT

6 months ago

🛠️ The ticket Websocket traffic disruptions has been marked as backlog.


Railway
BOT

5 months ago

This thread has been marked as solved automatically due to a lack of recent activity. Please re-open this thread or create a new one if you require further assistance. Thank you!

Status changed to Solved Railway 5 months ago


Railway
BOT

3 months ago

🛠️ The ticket Websocket traffic disruptions has been marked as triage.


Railway
BOT

2 months ago

🛠️ The ticket Websocket traffic disruptions has been marked as backlog.


Railway
BOT

a month ago

❌ The ticket Websocket traffic disruptions has been marked as canceled.


Loading...