Outbound networking failure - all external connections lost (March 15, 07:50-08:27 UTC)
Anonymous
PROOP

2 months ago

We experienced a complete outbound networking failure on our Railway services on March 15, 2026, approximately 07:50 to 08:27 UTC. Both our Primary and Worker services (20 replicas) simultaneously lost the ability to reach two independent external services in two different AWS regions:

- Supabase PostgreSQL (eu-central-1, Frankfurt) via connection pooler on port 6543

- Redis Cloud (eu-west-1, Ireland) on port 13326

Both services returned read ETIMEDOUT and connect ETIMEDOUT errors. The external services were confirmed healthy during this period via their respective dashboards and logs — Supabase showed normal CPU (25%), RAM (35%), and Postgres was accepting internal connections throughout. Redis Cloud reported no incidents.

After our workers crash-looped for ~37 minutes due to the connection failures, Railway's automatic restart mechanism stopped retrying (crash limit exceeded). Our services remained in a failed state from 08:27 until we manually restarted at 13:58 UTC — a total of ~6 hours of downtime.

Upon manual restart at 13:58, all connections succeeded immediately, confirming the networking issue had resolved itself hours earlier.

Questions:

1. Was there a networking incident affecting outbound connections from Railway around 07:50-08:27 UTC on March 15?

2. Were other customers affected?

3. What is Railway's crash loop restart limit, and is there a way to configure it to keep retrying indefinitely or for a longer period?

4. Is there a way to receive alerts when Railway stops restarting a service due to crash loop detection?

Project: n8n production deployment

Services affected: Primary (1 replica), Worker (20 replicas)

Solved

1 Replies

2 months ago

Your services were impacted by a known outbound TCP egress issue that affected multiple customers during that timeframe. Simultaneous ETIMEDOUT failures to independent external endpoints with no application changes on your side is consistent with the pattern we've been tracking. We sincerely apologize for the disruption, especially the extended downtime caused by the crash loop limit being reached. We've applied $100 in credits to your account for the impact.

Regarding restarts: since you're on the Pro plan, you can increase your restart policy in Service Settings > Deploy. You can set the policy to "Always" and raise the max restart count well beyond the default of 10, which would help your services ride out transient networking issues like this one. More details here: https://docs.railway.com/deployments/restart-policy

We don't currently have a built-in notification for when the crash loop limit is hit. For continuous monitoring, you can deploy something like Uptime Kuma to alert you when services become unresponsive.


Status changed to Awaiting User Response Railway 2 months ago


Railway
BOT

2 months ago

This thread has been marked as solved automatically due to a lack of recent activity. Please re-open this thread or create a new one if you require further assistance. Thank you!

Status changed to Solved Railway about 2 months ago


Welcome!

Sign in to your Railway account to join the conversation.

Loading...