4 months ago
We are running a FastAPI python service on us-east-1. About 15-20 times a day, Cloudflare will returning a 502 Bad Gateway thinking that the Railway app is down. Looking at the request-ids and headers, we never see the request make it to our app. We ruled out the following:
Server deployments
Client calls (we have the CF Ray where it says origin host is down)
WAF events - requests go through Cloudflare security according to Event logs
Happens in our staging and production environments (more often in production though)
We're thinking it could be something with one of the following:
Railway load balancing with keep-alive connections
Railway firewall (?) - from what we understand there's no additional Cloudflare with our app and the internet
Network hiccups (but seems to often for that many network incidents)
Thanks.
0 Replies
4 months ago
how are you starting your server?
funny you should ask, you provided the settings for me… NEW_RELIC_CONFIG_FILE=newrelic.ini NEW_RELIC_ENVIRONMENT=$ENVIRONMENT newrelic-admin run-program uvicorn app.main:app --host 0.0.0.0 --port $PORT --workers $UVICORN_WORKERS --loop uvloop --http httptools --limit-concurrency $UVICORN_LIMIT_CONCURRENCY --backlog $UVICORN_BACKLOG
4 months ago
the 502 im seeing happened in 57 milliseconds, so im not sure if it has to do with keep alives
4 months ago
something at the application layer is closing the connection, unfortunately i dont see anything sus here
4 months ago
i think its time to add more tracing into your app to help you find out whats going wrong
Yeah, I've added trace Ids and such, but nothing to trace on the app sinc the request never makes it.
We instrumented from the caller (NextJS) as far as we can. It's a blackbox between Cloudflare and Railway now (above the app layer). The app never gets the receives the request. (We log every reuqest)
4 months ago
i assume the logging will only actually log if the request completes, but if something is going wrong at your application level to cause this, there may be no logs
Well we log all exceptions and have sentry/newrelic instrumentation as well
4 months ago
our http logs are the same as we give you
4 months ago
did you know you can expand the http logs to see more information
4 months ago
our http logs
Yup. We are ingesting them to newrelic. Let me check the railway console, maybe newrelic is dropping some logs.
4 months ago
how are you ingesting our http logs to new relic?
Oh, and I didn't realize you can open the http logs.. there's not much additional info:
4 months ago
you are looking at your logs, not our logs, try clicking the http logs tab
4 months ago
yeah put a @
symbol into the search and you will get an option to filter by http status
Neat. Ok, I can see the 502 in the http logs. Yeah, there's a strange… responseDetails: "failed to forward request to upstream: connection closed unexpectedly"
4 months ago
yeah that means your logging and error handling did not capture whatever error caused that request to close prematurely
Yup. Thanks for the http logs. I didn’t realize those existed in the other tab. And looks like I can’t log drain those, only my app logs.
4 months ago
no there arent, sorry
No worries. I read the feature request pages. Are there any policies against scraping the logs?
4 months ago
you would want to use the API to grab them
It started happening when we increased workers to 16. We reduced to 4 and are not seeing any 502s yet. Possibly some connection keep alive issue occurring at the starlette connection handling before it’s handed off to our app.
i maybe off my base but isn't 16 workers alot tho. python is notoriously ram hungry and a worker can go upto 500mb if not optimized. so you could be saturating the 8gb ram railway has per service
that could explain y connection closes unexpectedly if it's saturating the ram
I thought each service has 32 GB of ram and scales appropriately? We were running about 300Mb per worker… here's our memory usage before and after…
Traffic was low so maybe workers were idling out? Not quite sure. I think there's something with keep alive settings between Railway HTTP load balancer and FastAPI Uvicorn, but not sure what the settings are for the HTTP load balancer. Are there recommend network settings?
4 months ago
the 502 I saw happened in 50 ms so timeouts do not come into play with such short requests
The other timeout was 0 ms, so I don't think it's timeout (like the request took too long), but possible connection re-use and timeout while managing connections in a pool/load balancer.
I totally agree with @Brody it’s something with our FastApi service. I’ll continue to monitor the workers and 502s. Thanks for all the help!
4 months ago
do you have max_requests?
Quick update here… switched to Gunicorn with Uvicorn workers and had the same issue, switched to Granian and occurrences are much less now (once a day). I think at this point it's something to do with NewRelic monitoring. Closing out this thread. Thanks all!