2 months ago
After deploying update today I started receiving a lot of 503 errors. I believe the proxy is dropping connections due to heavy load but this wasn't the case before so I suspect changes made to the edge proxy recently is the cause
8 Replies
2 months ago
We have not made any changes. If our edge is returning a 503, it means your app is unresponsive. Given it's a Node app, and the vCPU usage is hovering around 1 vCPU, that checks out. You will want to horizontally scale out.
Issue happened after i pushed build at 2:54
Attachments
2 months ago
I've stated the facts. I'll step out and defer this to the community because our side of things is operating normally.
2 months ago
Your app doesn't respond consistently. We send a 503. I think your focus should be on why your app is consuming 1 vCPU, and if that's normal, then you need to horizontally scale.
same exact build redeployed today fixed all 503 issues. same cpu usage, same amount of network requests, no code change.
20 days ago
Since a redeploy of the same build cleared it, I would debug this as per-deployment process state first, not as proof of an edge proxy regression.
Railway docs describe this class as Edge Proxy cannot communicate with the app. The common causes are wrong host/port or target port, and the less common cause is the app being under enough load that it does not respond. Brody already pointed at the third one because Node near 1 vCPU can mean the event loop is saturated.
The useful split is:
- Did every instance of that deployment return 503, or only one replica? If one replica, it points to a bad/stuck process, startup path, or event-loop blocking after boot.
- Did the service have a healthcheck that would have caught the stuck process before traffic moved? If not, add a shallow
/healthand make sure it returns only when the HTTP server is ready. - Was the public domain target port fixed to the same port the app listens on? If target port or
PORTwas mismatched, redeploys can look random depending on which process actually binds. - Compare runtime logs for the bad deployment ID vs the good redeploy ID around 2:54. Look for long synchronous work, startup jobs, queue consumers, DB pool exhaustion, or repeated GC/memory pressure.
If CPU and request volume were truly identical after the redeploy, the next strongest evidence would be per-instance request logs and metrics for the bad deployment, not aggregate service CPU. Aggregate CPU can hide one stuck Node process or one replica that is accepting traffic but not completing requests.
Practical mitigation: add/verify a healthcheck, horizontally scale if Node sits near one vCPU under normal traffic, and keep the bad deployment ID plus the good redeploy ID for Railway so they can compare edge-to-instance routing for the same image.