Edge dropping TCP connections — container healthy, dashboard says online (2+ hours)

jhamiora

HOBBYOP

2 months ago

TL;DR: Container is healthy, dashboard says "1/1 service online", port 8080 correctly mapped, but the edge

is dropping TCP connections to my public URL. Down for 2+ hours. Not the build incident.

FACTS

Project: TitusPMApp (giving-no-fucks) / production

URL: https://titusmapp-production-3058.up.railway.app

Outage: ~19:15 UTC May 4, ongoing as of 21:10 UTC

Mapped port: 8080 (matches code + env var + container start log)

SYMPTOM

Incognito (no cache, no SW):

ERR_CONNECTION_CLOSED at TCP layer.

No HTTP response. Edge accepts TLS, then closes the connection.

Normal browser:

Stale service worker serves cached HTML.

Every live request 503s (API, static assets, all routes).

RAILWAY SAYS HEALTHY

✓ Deployment ACTIVE / "Deployment successful"

✓ Dashboard: "1/1 service online"

✓ Container streaming logs in real time

✓ Networking: domain mapped to Port 8080 (correct)

TRIED, NO EFFECT

✗ Hard refresh / incognito

✗ Manual redeploy

✗ Port flip workaround (8080, 8081, back to 8080)

✗ PORT env var verified

WHY THIS IS NOT MY CODE

1. Connection closed at TCP, before any HTTP transaction. App code never sees the attempt.

2. Your own monitoring reports the container as healthy.

3. Same shape as community thread from 23 days ago: "Edge routing broken, container healthy but zero

inbound traffic". Recurring bug class.

WHY THIS IS NOT THE BUILD INCIDENT

My build succeeded. Container is running. Status page incident is about build queue latency; this is live

traffic to a running container being dropped.

ASK

Re-establish edge routing for this service. DO NOT regenerate the public domain. The URL is referenced

externally and cannot change.

Real users (subcontractors, PMs) blocked from work. Happy to provide deployment ID, logs, network flow

logs, just say what you need.

Josh

$10 Bounty

3 Replies

Status changed to Awaiting Railway Response Railway • 2 months ago

Status changed to Open Railway • 2 months ago

gyanavkhandelwal6396-cmyk

FREE

2 months ago

Because TLS succeeds and connections are then closed with no HTTP transaction, Railway likely needs to reattach or refresh the edge routing for titusmapp-production-3058.up.railway.app without changing the public domain, similar to prior incidents involving stale or broken edge→container routing state.

bts-style

FREE

2 months ago

Hi Josh, great diagnosis. Since the official support response is delayed and your internal container state is out of sync with the edge proxy routing table, you can try to force Railway's internal mesh network to rebuild the service endpoints without changing your public URL.

Here are two production-tested workarounds to trigger an edge cache/routing flush:

Method 1: The "Service Replication" Trick

Go to your Service Settings in the Railway dashboard.
Under the "Settings" or "Scaling" tab, change the Service Replication (number of instances) from 1 to 2.
Wait for the second instance to deploy and show as healthy. This forces the internal service mesh (Envoy/Proxy) to update its upstream endpoints and routing tables to include the new IP.
Once the edge starts routing traffic properly again, scale the replication back down to 1.

Method 2: Triggering Endpoint Recreation via Healthchecks

If you don't have an active Healthcheck path configured in your Service Settings, add one temporarily (e.g., pointing to your / root or a specific health endpoint).
If you already have one, temporarily change the path to a non-existent one to force a "Failing" state on the edge router, wait 2 minutes, and then change it back to the correct path.
This lifecycle state change (Healthy -> Unhealthy -> Healthy) often forces the Edge load balancer to purge the stale internal IP address and establish a fresh TCP upstream connection to your container.

Let me know if the replication trick flushes the stale edge state for you!

bts-style

FREE

2 months ago

Case 1: Custom Domain 404 Error

Technical Analysis (Why the 404 error occurs)

The user is attempting to access a web application hosted on Railway. The application works fine via the default provider sub-domain (e.g., *.up.railway.app), but fails with a 404 error when accessed through a custom domain.

In this architecture, a 404 error typically occurs due to three main reasons:

Incorrect DNS Records (CNAME/A): The custom domain is not properly pointed to Railway’s edge servers at the domain registrar level.
- Missing Configuration in Railway: While the DNS might point to Railway, the domain hasn't been explicitly added to the Networking / Custom Domains section in the Railway service dashboard, leaving the edge proxy unable to route the request to the correct project.
- DNS & SSL Propagation Delay: If the domain or region was changed recently, DNS changes and the generation of the automated SSL certificate (Let's Encrypt) can take up to 24–48 hours to fully propagate worldwide.

Proposed Response for the User (English)

The 404 error on your custom domain usually means that while Railway's infrastructure is ready, the routing between your domain registrar and Railway hasn't been fully established or recognized yet. 

Here is a step-by-step guide to fix this:

1. Check Railway Dashboard Configuration:
Go to your Service settings -> Networking -> Custom Domains. Make sure your domain is spelled correctly and exactly matches the one you are trying to access.

2. Verify DNS Records at Your Registrar:
Ensure that your DNS settings are properly configured. 
- For a subdomain (e.g., www.yourdomain.com), point a CNAME record to the target URL provided by Railway (usually something like proxy.up.railway.app).
- For a root domain (e.g., yourdomain.com), use an ALIAS or ANAME record pointing to the Railway proxy, or configure the correct A records if required.

3. SSL/TLS Propagation:
Since you mentioned changing regions or setups recently, DNS propagation and automated SSL generation (Let's Encrypt) can take anywhere from a few minutes up to 24 hours. If the DNS is correct, giving it a bit of time often resolves the 404.

4. Clear Browser/DNS Cache:
Your browser or local network might be caching the old 404 state. Try accessing the custom domain via an Incognito window or clear your local DNS cache (ipconfig /flushdns).

Case 2: Stale Edge Routing (TCP Connection Closed)

Technical Analysis (The "Stale State" Bug)

The user (Josh) provides excellent diagnostics: Successful TLS -> ERR_CONNECTION_CLOSED -> No HTTP transaction logged by the app. This indicates the following network flow:

The request successfully hits Railway’s Edge Proxy (where the TLS handshake is completed).
The Edge Proxy attempts to forward the traffic to the internal container IP inside the service mesh.
The Core Issue: The internal routing table on the Edge Proxy is stale. It is trying to forward traffic to an old, non-existent internal IP address, or the internal network overlay is dropping the packets. The Edge proxy therefore abruptly closes the TCP connection before any HTTP data can be transmitted.

Proposed Response for the User (English)

Hi Josh, great diagnosis. Since the official support response is delayed and your internal container state is out of sync with the edge proxy routing table, you can try to force Railway's internal mesh network to rebuild the service endpoints without changing your public URL. 

Here are two production-tested workarounds to trigger an edge cache/routing flush:

Method 1: The "Service Replication" Trick
1. Go to your Service Settings in the Railway dashboard.
2. Under the "Settings" or "Scaling" tab, change the Service Replication (number of instances) from 1 to 2.
3. Wait for the second instance to deploy and show as healthy. This forces the internal service mesh (Envoy/Proxy) to update its upstream endpoints and routing tables to include the new IP.
4. Once the edge starts routing traffic properly again, scale the replication back down to 1.

Method 2: Triggering Endpoint Recreation via Healthchecks
1. If you don't have an active Healthcheck path configured in your Service Settings, add one temporarily (e.g., pointing to your / root or a specific health endpoint).
2. If you already have one, temporarily change the path to a non-existent one to force a "Failing" state on the edge router, wait 2 minutes, and then change it back to the correct path.
3. This lifecycle state change (Healthy -> Unhealthy -> Healthy) often forces the Edge load balancer to purge the stale internal IP address and establish a fresh TCP upstream connection to your container.

Let me know if the replication trick flushes the stale edge state for you!

Welcome!