22 days ago
We've hit this several times now: Our domain returns the "Application failed to respond" page for several minutes at a time. Clicking "Restart" in the Railway dashboard immediately resolves it, every time.
The replicas themselves are healthy during the outage:
- QStash/internal traffic to /process-webhook-comment/* keeps flowing
across all 4 replicas the entire time (~5 req/s, p99 < 1s)
- No app-side errors, no OOM, no crash, no SIGTERM
- Only public-edge traffic to our Domain are affected
Most recent occurrence (2026-04-30 UTC):
- 13:31:25 GET / → 200 1ms (last successful public request)
- 13:31:25 → 13:40:26 no successful public responses (~9 min)
- 13:39:26 GET / received but never responded (event loop appears
fine — internal /process-webhook-comment/* requests in the
same window resolved in <1s on the same replicas)
- 13:40:26 GET / → 200 0ms (after manual restart)
- Request ID from the error page: jKHCszjVTHyj6bEJGbGh5g
This pattern (public edge dead, private/internal traffic fine, restart instantly fixes it) suggests the issue is in the edge proxy / routing ayer rather than our app. Can you take a look?
1 Replies
Status changed to Awaiting Railway Response Railway • 22 days ago
22 days ago
This thread has been marked as public for community involvement, as it does not contain any sensitive or personal information. Any further activity in this thread will be visible to everyone.
Status changed to Open Railway • 22 days ago
11 days ago
Given internal traffic keeps working while only the public domain hangs, this looks less like an app crash and more like stale public edge/routing state.
I would still include one app-side sanity check: the process is bound to 0.0.0.0:$PORT.
Then give Railway a tight trace packet:
1. affected domain
2. UTC outage window
3. error-page request ID
4. deployment/replica IDs
5. one internal request that succeeded in the same window
That is the useful evidence for them to trace the edge route without turning this into a generic deploy-debug thread.