16 days ago
Hi Railway,
We’re seeing intermittent availability issues on a Railway-hosted service.
- DNS resolves normally
- TCP/TLS handshake succeeds
- But HTTP response body is not returned
- Requests end with timeout (
code=000) after ~20s
Observed on multiple endpoints (including / and API routes), from different clients/networks, including an external environment.
Example symptom:
curl --http1.1 --max-time 20 ...- timeout after ~20s, no response bytes
Could you check for any current edge/routing/runtime incident that could cause stalled upstream responses after successful TLS/connect?
We can share exact UTC timestamps and host details if needed.
Thanks.
7 Replies
Status changed to Open Railway • 16 days ago
16 days ago
Hi team, we saw your status update that this incident was resolved, but the issue is still present for us.
We re-tested just now (HTTP/1.1, max-time 15s) and requests still time out with no response body:
/api/config→code=000, timeout after ~15s, 0 bytes received/api/auth/avatar-presets→code=000, timeout after ~15s, 0 bytes received/api/auth/car-presets→code=000, timeout after ~15s, 0 bytes received
So even after the “resolved” update, upstream responses are still not being returned on our side.
15 days ago
Same issue is happening again now.
Pattern is unchanged: DNS OK, TCP/TLS OK, but no HTTP response (timeouts with code=000 after ~20–35s) on / and API endpoints.
This is a recurrence of the same outage pattern from yesterday, not a one-off client/network issue.
Please check for recurring platform-side instability (edge→service routing / stalled upstream / runtime networking) and correlate with current timestamps.
14 days ago
Update: recurring outage for 3 days in a row
Hi Railway team,
this is now happening for the 3rd day in a row.
Same pattern every day:
- DNS resolves correctly
- TCP to 443 succeeds
- but HTTP requests stall and timeout (
HTTP 000, no first byte)
Just reproduced again on multiple endpoints:
//api/maintenance/status/api/auth/car-presets
All timed out (~15s) with no response body, while DNS/TCP were still healthy.
From app/runtime logs we also see client-side timeouts and disconnects during these windows, but this appears to be a platform/runtime availability issue, not a single bad request.
Can you please investigate this as a recurring stability incident (not a one-off), and check:
- edge -> service routing during incident windows,
- runtime/container health and restarts,
- internal networking / upstream stalls,
- any known regional degradation.
Please advise what concrete mitigation we can apply to avoid daily downtime.
10 days ago
again ?
4 days ago
The repeated pattern — successful DNS resolution and TLS handshake followed by stalled requests with HTTP 000 and no first byte — strongly suggests an intermittent Railway edge→container routing or upstream networking issue rather than an application-level failure.
Because this has recurred across multiple days, endpoints, and networks with identical symptoms, Railway should investigate persistent runtime/edge instability or stalled upstream connections during the incident windows; recommended mitigations include multiple replicas, shorter keep-alive/request timeouts, proactive container restarts, and regional failover if available.
4 days ago
again!
4 days ago
Looking at your thread, this is a classic TCP-connected-but-HTTP-stalled pattern. Here's a breakdown of likely causes and concrete mitigations you can apply on your end, since Railway's "resolved" status clearly isn't matching your reality.
What's Actually Happening
The symptom (TLS OK → no first byte → timeout) narrows it to one of three layers:
Client → Railway Edge (Proxy) → Your Container
✅ TLS handshake ❓ upstream stall hereRailway's edge accepts your connection, but the request never gets a response from your container — either because the container is silently stuck, or the edge→service routing is dropping/stalling the forward.
Diagnostics to Run Right Now
- Check container restart frequency
In Railway dashboard → your service → Deployments tab. Look for:
Frequent restarts (OOM kills, crash loops)
Gaps in uptime that align with your timeout windows
- Add a dead-simple health probe to your app
If you don't have one, add a /health route that returns 200 OK with zero logic. Then use Railway's built-in healthcheck config:
toml# railway.toml
[deploy]
healthcheckPath = "/health"
healthcheckTimeout = 10
restartPolicyType = "ON_FAILURE"
restartPolicyMaxRetries = 5
This forces Railway to restart your container if it goes unresponsive, rather than leaving it in a zombie state accepting TCP but not processing HTTP.
- Capture exact UTC timestamps during next incident
bash# Run this during an outage window
while true; do
RESULT=$(curl -o /dev/null -s -w "%{http_code} | time_connect:%{time_connect} | time_starttransfer:%{time_starttransfer} | total:%{time_total}\n" \
--http1.1 --max-time 20 https://YOUR_DOMAIN/health)echo "$(date -u +%Y-%m-%dT%H:%M:%SZ) | $RESULT"
sleep 5
done
The time_starttransfer field is key — if it's 0.000 at timeout, the stall is pre-first-byte (proxy or container), not slow response.
Mitigations You Can Apply Unilaterally
Fix 1 — Force HTTP/1.1 + shorter keepalive on your server
If your app uses Node/Express, long-lived keepalive connections can go stale through Railway's proxy:
js// Express / Node HTTP server
server.keepAliveTimeout = 65000; // must be > Railway proxy's 60s
server.headersTimeout = 66000; // must be > keepAliveTimeout
For other runtimes:
RuntimeSettingPython/Gunicorn--timeout 30 --keep-alive 65GoServer.ReadTimeout, WriteTimeout = 30sJava/Springserver.tomcat.connection-timeout=30s
Fix 2 — Add a Railway restart trigger
Railway supports cron-based restarts. If you're hitting a daily zombie pattern, a scheduled restart before your typical incident window is a blunt but effective workaround:
toml# railway.toml — add a cron service that hits a restart webhook
Or use Railway's "Redeploy" button via their API on a schedule while you investigate root cause.
Fix 3 — Reduce max container memory + add OOM guard
Silent OOM kills (where the container is still "running" but not scheduling work) match your symptoms exactly. In Railway dashboard:
Set an explicit Memory limit (e.g., 512MB) so OOM kills are clean and trigger a restart
Watch Metrics tab during next incident — CPU/memory spike = container-side cause; flat = edge/routing cause
Fix 4 — Enable Railway's TCP proxy mode (for non-HTTP services)
If any of your services are non-HTTP (websockets, gRPC, etc.), make sure Railway is configured for the right proxy mode — HTTP proxy mode will silently mangle non-HTTP traffic at the TLS termination layer.
Escalating Effectively to Railway
Since their support thread isn't getting traction, file a follow-up with this format — it forces them to look at infrastructure logs rather than dismiss as client issue:
Subject: Recurring P1 — upstream stall post-TLS, 3+ days, not resolved by incident closure
Service: [your Railway service ID]
Project: [project ID]
Region: [us-west1 / eu-west etc.]
Incident windows (UTC):
Day 1: YYYY-MM-DD HH:MM – HH:MM UTC
Day 2: YYYY-MM-DD HH:MM – HH:MM UTC
Day 3: YYYY-MM-DD HH:MM – HH:MM UTC
Request: Correlate edge proxy logs for our service during these windows.
Specifically: was the upstream (container) being marked healthy? Were requests being forwarded?
curl -v --http1.1 --max-time 20 https://[domain]/health → HTTP 000, time_starttransfer=0
The service ID and exact UTC windows are what Railway's infra team needs to pull edge logs. Without those, support will keep closing tickets prematurely.
this is what my counsultant AI says
