code=000 + 20s timeout (connect/TLS ok, no HTTP response body)

alpacco

HOBBYOP

2 months ago

Hi Railway,

We’re seeing intermittent availability issues on a Railway-hosted service.

DNS resolves normally
TCP/TLS handshake succeeds
But HTTP response body is not returned
Requests end with timeout (code=000) after ~20s

Observed on multiple endpoints (including / and API routes), from different clients/networks, including an external environment.

Example symptom:

curl --http1.1 --max-time 20 ...
timeout after ~20s, no response bytes

Could you check for any current edge/routing/runtime incident that could cause stalled upstream responses after successful TLS/connect?

We can share exact UTC timestamps and host details if needed.

Thanks.

$10 Bounty

7 Replies

Status changed to Open Railway • 2 months ago

alpacco

HOBBYOP

2 months ago

Hi team, we saw your status update that this incident was resolved, but the issue is still present for us.

We re-tested just now (HTTP/1.1, max-time 15s) and requests still time out with no response body:

/api/config → code=000, timeout after ~15s, 0 bytes received
/api/auth/avatar-presets → code=000, timeout after ~15s, 0 bytes received
/api/auth/car-presets → code=000, timeout after ~15s, 0 bytes received

So even after the “resolved” update, upstream responses are still not being returned on our side.

alpacco

HOBBYOP

2 months ago

Same issue is happening again now.

Pattern is unchanged: DNS OK, TCP/TLS OK, but no HTTP response (timeouts with code=000 after ~20–35s) on / and API endpoints.

This is a recurrence of the same outage pattern from yesterday, not a one-off client/network issue.

Please check for recurring platform-side instability (edge→service routing / stalled upstream / runtime networking) and correlate with current timestamps.

alpacco

HOBBYOP

2 months ago

Update: recurring outage for 3 days in a row

Hi Railway team,

this is now happening for the 3rd day in a row.

Same pattern every day:

DNS resolves correctly
TCP to 443 succeeds
but HTTP requests stall and timeout (HTTP 000, no first byte)

Just reproduced again on multiple endpoints:

/
/api/maintenance/status
/api/auth/car-presets

All timed out (~15s) with no response body, while DNS/TCP were still healthy.

From app/runtime logs we also see client-side timeouts and disconnects during these windows, but this appears to be a platform/runtime availability issue, not a single bad request.

Can you please investigate this as a recurring stability incident (not a one-off), and check:

edge -> service routing during incident windows,
runtime/container health and restarts,
internal networking / upstream stalls,
any known regional degradation.

Please advise what concrete mitigation we can apply to avoid daily downtime.

alpacco

HOBBYOP

2 months ago

again ?

gyanavkhandelwal6396-cmyk

FREE

2 months ago

The repeated pattern — successful DNS resolution and TLS handshake followed by stalled requests with HTTP 000 and no first byte — strongly suggests an intermittent Railway edge→container routing or upstream networking issue rather than an application-level failure.

Because this has recurred across multiple days, endpoints, and networks with identical symptoms, Railway should investigate persistent runtime/edge instability or stalled upstream connections during the incident windows; recommended mitigations include multiple replicas, shorter keep-alive/request timeouts, proactive container restarts, and regional failover if available.

alpacco

HOBBYOP

2 months ago

again!

gyanavkhandelwal6396-cmyk

FREE

2 months ago

Looking at your thread, this is a classic TCP-connected-but-HTTP-stalled pattern. Here's a breakdown of likely causes and concrete mitigations you can apply on your end, since Railway's "resolved" status clearly isn't matching your reality.

What's Actually Happening

The symptom (TLS OK → no first byte → timeout) narrows it to one of three layers:

Client → Railway Edge (Proxy) → Your Container

     ✅ TLS handshake        ❓ upstream stall here

Railway's edge accepts your connection, but the request never gets a response from your container — either because the container is silently stuck, or the edge→service routing is dropping/stalling the forward.

Diagnostics to Run Right Now

Check container restart frequency

In Railway dashboard → your service → Deployments tab. Look for:

Frequent restarts (OOM kills, crash loops)

Gaps in uptime that align with your timeout windows

Add a dead-simple health probe to your app

If you don't have one, add a /health route that returns 200 OK with zero logic. Then use Railway's built-in healthcheck config:

toml# railway.toml

[deploy]

healthcheckPath = "/health"

healthcheckTimeout = 10

restartPolicyType = "ON_FAILURE"

restartPolicyMaxRetries = 5

This forces Railway to restart your container if it goes unresponsive, rather than leaving it in a zombie state accepting TCP but not processing HTTP.

Capture exact UTC timestamps during next incident

bash# Run this during an outage window

while true; do

RESULT=$(curl -o /dev/null -s -w "%{http_code} | time_connect:%{time_connect} | time_starttransfer:%{time_starttransfer} | total:%{time_total}\n" \

--http1.1 --max-time 20 https://YOUR_DOMAIN/health)

echo "$(date -u +%Y-%m-%dT%H:%M:%SZ) | $RESULT"

sleep 5

done

The time_starttransfer field is key — if it's 0.000 at timeout, the stall is pre-first-byte (proxy or container), not slow response.

Mitigations You Can Apply Unilaterally

Fix 1 — Force HTTP/1.1 + shorter keepalive on your server

If your app uses Node/Express, long-lived keepalive connections can go stale through Railway's proxy:

js// Express / Node HTTP server

server.keepAliveTimeout = 65000; // must be > Railway proxy's 60s

server.headersTimeout = 66000; // must be > keepAliveTimeout

For other runtimes:

RuntimeSettingPython/Gunicorn--timeout 30 --keep-alive 65GoServer.ReadTimeout, WriteTimeout = 30sJava/Springserver.tomcat.connection-timeout=30s

Fix 2 — Add a Railway restart trigger

Railway supports cron-based restarts. If you're hitting a daily zombie pattern, a scheduled restart before your typical incident window is a blunt but effective workaround:

toml# railway.toml — add a cron service that hits a restart webhook

Or use Railway's "Redeploy" button via their API on a schedule while you investigate root cause.

Fix 3 — Reduce max container memory + add OOM guard

Silent OOM kills (where the container is still "running" but not scheduling work) match your symptoms exactly. In Railway dashboard:

Set an explicit Memory limit (e.g., 512MB) so OOM kills are clean and trigger a restart

Watch Metrics tab during next incident — CPU/memory spike = container-side cause; flat = edge/routing cause

Fix 4 — Enable Railway's TCP proxy mode (for non-HTTP services)

If any of your services are non-HTTP (websockets, gRPC, etc.), make sure Railway is configured for the right proxy mode — HTTP proxy mode will silently mangle non-HTTP traffic at the TLS termination layer.

Escalating Effectively to Railway

Since their support thread isn't getting traction, file a follow-up with this format — it forces them to look at infrastructure logs rather than dismiss as client issue:

Subject: Recurring P1 — upstream stall post-TLS, 3+ days, not resolved by incident closure

Service: [your Railway service ID]

Project: [project ID]

Region: [us-west1 / eu-west etc.]

Incident windows (UTC):

Day 1: YYYY-MM-DD HH:MM – HH:MM UTC

Day 2: YYYY-MM-DD HH:MM – HH:MM UTC

Day 3: YYYY-MM-DD HH:MM – HH:MM UTC

Request: Correlate edge proxy logs for our service during these windows.

Specifically: was the upstream (container) being marked healthy? Were requests being forwarded?

curl -v --http1.1 --max-time 20 https://[domain]/health → HTTP 000, time_starttransfer=0

The service ID and exact UTC windows are what Railway's infra team needs to pull edge logs. Without those, support will keep closing tickets prematurely.

this is what my counsultant AI says

Welcome!