13 days ago
## Summary
Custom domain `engine.govnu.dev` returns intermittent HTTP 502 with `X-Railway-Fallback: true` ("Application failed to respond") depending on which Railway edge node serves the request. Some edges (e.g., `railway/europe-west4-drams3a`) have the upstream mapping and route correctly; other edges (`railway/us-west2`, `railway/us-east4`) do not. Same container, same deploy, same time window. The auto-generated `*.up.railway.app` hostname routes correctly from ALL edges.
## Project + service IDs
- Project ID: `ad269685-f032-4565-b52f-7c027ea255fa`
- Environment ID: `76596381-b6f3-4217-a75d-ceba260b2a18`
- Service ID: `4709dc28-4241-4317-97fb-5add938d9e2a` (`@govnu/engine`)
- Custom domain: `engine.govnu.dev`
- CNAME target: `eg3schz6.up.railway.app`
- Custom domain target port: `8080`
- `PORT` env variable: `8080` (explicitly set in service variables)
- Engine listening port: `8080` (verified in deploy logs: `HTTP server running ... port=8080`)
- Healthcheck path: `/api/ping` — passes on every deploy
## Symptom
### 502 fallback responses (from US Comcast routes)
HTTP/1.1 502 Bad Gateway
Server: railway-edge
X-Railway-Edge: railway/us-west2
X-Railway-Fallback: true
X-Railway-Request-Id: vN53_-L8Q3aGAq1flt7tkg
{"status":"error","code":502,"message":"Application failed to respond","request_id":"vN53_-L8Q3aGAq1flt7tkg"}
Other failing request IDs same time window: `_obQLUh0SrSKt8Gj2prcFg`, `hPKoSQHJTNqUMK0f-_9nXA`.
### Successful request (Europe edge — engine container received it)
From engine deploy log of `06573520-ee77-4ec1-8c85-16e437eb2b92` at `2026-05-10T05:16:53.809015418Z`:
```json
{
"method": "GET",
"url": "/",
"headers": {
"host": "engine.govnu.dev",
"x-forwarded-host": "engine.govnu.dev",
"x-railway-edge": "railway/europe-west4-drams3a",
"x-railway-request-id": "FUWO9NQ4QsuDoXJVnbOCzg"
},
"responseStatus": 404
}The 404 is correct (GET / has no route — only /api/ping is implemented). The point is the request reached the engine container via railway/europe-west4-drams3a while parallel requests from US edges to the same domain returned 502 fallbacks.
Working request via *.up.railway.app (control)
HTTP/1.1 200 OK
X-Railway-Edge: railway/us-east4-eqdc4a
{"status":"ok","service":"govnu-engine","timestamp":"..."}Same engine container, working from same edge that returns 502 for the custom domain.
Configuration steps already attempted (none resolved)
- Custom domain add with target port
4000(DockerfileENV PORT=4000default). Worked briefly, broke after first redeploy. - Updated target port
4000 → 8080after discovering Railway runtime injectsPORT=8080overriding Dockerfile. Worked briefly, broke after next redeploy. - Removed + re-added the custom domain — Railway issued new CNAME target
g0prprpk → eg3schz6. Updated Namecheap CNAME, TXT verification re-resolved correctly. Domain showed green/active. Worked briefly, broke after next redeploy. - Set
PORT=8080explicitly in service variables per Railway docs ("If your application is listening on an explicitly defined port, you must define a PORT variable"). Triggered fresh deploy06573520. Custom domain still shows the same intermittent edge-node behavior. - DNS verified resolving correctly via Google DNS (
8.8.8.8) and Cloudflare DNS (1.1.1.1). TLS handshake succeeds on all edges (cert is provisioned).
Hypothesis
Railway's anycast edge mesh does not propagate custom-domain → upstream mappings consistently across all edge nodes. The auto-generated *.up.railway.app mapping appears to use a different (more reliable) propagation path than user-added custom domains.
Request
- Investigate why
engine.govnu.dev → service 4709dc28-4241-4317-97fb-5add938d9e2a:8080mapping is inconsistent across edge nodes (specificallyrailway/us-west2andrailway/us-east4). - Force-refresh the upstream mapping for our custom domain across all edges.
- Documentation update if there is a known operator-side configuration to ensure consistent edge propagation.
Operational impact
End-user voice transcription feature (Tier 4 managed Deepgram via WebSocket gateway) is non-functional for users routed through US edges. Browser falls back gracefully to Tier 1 fallback, but Tier 4 is the intended path. Pre-beta product so impact is contained, but blocks our voice product launch.
Workaround in place
We have NOT swapped our frontend's WebSocket URL to point to wss://govnuengine-production.up.railway.app because we want to confirm the issue is reproducible on your side first. We can swap if needed but prefer to keep the branded engine.govnu.dev URL.
1 Replies
13 days ago
This thread has been marked as public for community involvement, as it does not contain any sensitive or personal information. Any further activity in this thread will be visible to everyone.
Status changed to Open Railway • 13 days ago
12 days ago
After looking through everything, I’m pretty sure this is an edge propagation/cache issue on Railway’s side rather than a problem with the app itself.
The biggest clue is that the generated *.up.railway.app URL works perfectly from every tested region and edge node, while only the custom domain fails intermittently. Since it’s the exact same service/container behind both domains, that tells me the app, TLS, and port configuration are all probably fine.
At this point it seems like some Railway edge nodes still have stale routing information for the custom domain after redeploys.
A few things that usually help fix this:
- completely remove the custom domain from Railway
- delete the DNS records from Namecheap
- wait a few minutes before re-adding anything
- re-add the domain using port
8080manually - only use the brand new CNAME Railway generates
- avoid redeploying for ~15-30 minutes while propagation finishes
I’d also try removing the manually set PORT=8080 variable and just letting Railway inject the port automatically. The app should use:
const port = process.env.PORT || 8080;
Also make sure the server is binding to:
0.0.0.0
and not localhost.
Another thing that sometimes fixes this completely is cloning the Railway service and attaching the domain to the new service instead. That forces Railway to generate fresh internal routing/upstream mappings.
For now, the safest workaround would probably be temporarily switching the frontend/WebSocket URL to the Railway-generated domain since that’s routing correctly everywhere already.