a year ago
Not sure if this is affecting others as well!
I upgraded my (otherwise stable) FastAPI server to have horizontal scaling from 1 instance to 4 and I have a proliferation of intermitten 502 bad gateway errors from the railway proxy.
I'm hitting this issue with ~5-10% of request failing to execute (even with exponential retry), which is hurting my customers workflows.
The docs, https://docs.railway.com/reference/errors/application-failed-to-respond, suggest either that the IP/ports are not configured properly or that the server is under heavy strain.
Confirmed the ports and ip. 0.0.0.0:8080
I don't think its an "Application Under Heavy Load" error either, since you'd expect each instance to have LESS load after adding horizontal scaling. I am virtually error free (Maybe 1 502 error per ~1000 requests) but shoot up to ~50-100 errors per 1000 requests.
This leads me to believe it may be an issue with railway's edge routing or load balancing between my instances…
Would love any help from someone who's either great at FastAPI or railway scaling.
https://station.railway.com/questions/i-get-a-surge-of-502-errors-when-adding-671834e0
Project ID: 7c37f583-3c7b-4fb6-8fc8-9b57b0eb3606
Service ID: 1eee9966-3a1e-4f0b-9313-5737db82166d
10 Replies
a year ago
what is the error you get from the 502's? it's shown in the http logs
@Brody
Here's my error:
hud.server.requests.RequestError: Request failed with status 502 - JSON response: {'status': 'error', 'code': 502, 'message': 'Application failed to respond', 'request_id': 'mn4qUBnPQpalDRtY8sXEeg_1861343781'} | Status: 502 | Response Text: {"status":"error","code":502,"message":"Application failed to respond","request_id":"mn4qUBnPQpalDRtY8sXEeg_1861343781"} | Response JSON: {'status': 'error', 'code': 502, 'message': 'Application failed to respond', 'request_id': 'mn4qUBnPQpalDRtY8sXEeg_1861343781'} | Headers: {'content-length': '120', 'content-type': 'application/json', 'server': 'railway-edge', 'x-railway-edge': 'railway/us-west1', 'x-railway-fallback': 'true', 'x-railway-request-id': 'mn4qUBnPQpalDRtY8sXEeg_1861343781', 'date': 'Sat, 08 Mar 2025 02:54:24 GMT'}a year ago
please provide the error given by the HTTP logs on railway
requestId:"Vv0RYWvKT8m0nnbYpEsybQ_3167001623"
timestamp:"2025-03-08T08:00:00.088690431Z"
method:"POST"
path:"/hud-gym/api/v1/execute_step/57ea66ed-190a-45e6-879f-b5abb8bcf1c7"
host:"orchestrator.hud.live"
httpStatus:502
upstreamProto:"HTTP/1.1"
downstreamProto:"HTTP/1.1"
responseDetails:"failed to forward request to upstream: body read after close"
totalDuration:4011
upstreamAddress:"http://[fd12:4680:400d:0:a000:8:396b:580f]:8000"
clientUa:"python-httpx/0.28.1"
upstreamRqDuration:4011
txBytes:120
rxBytes:420
srcIp:"136.25.59.57"
edgeRegion:"us-west1"
Btw, sorry @Brody I didn't realize that a discord help message opens up another help thread on help-station.
I made a request there at first: https://station.railway.com/questions/i-get-a-surge-of-502-errors-when-adding-671834e0
Happy to upgrade to enterprise and hop on a call to work through this.
a year ago
does this ever happen on the legacy edge network
Actually this was on the legacy edge network!
I moved to metal edge and am going to test now…
a year ago
oh then if these errors are on the legacy edge network that means these are application level errors, the legacy edge network has been GA for over 6 months
Interesting.
Is there a reason you can think of on the application level would have no 502's on a single instance and then many 502's when using horizontal scaling?
If the application was under heavy load, I would imagine there's LESS load with multiple replica instances rather than the other way around
a year ago
I don't have any ideas unfortunately, what does that endpoint do?