Occassional 502 Bad Gateway from Cloudflare to origin

owen-marit
PRO

4 months ago

We are running a FastAPI python service on us-east-1. About 15-20 times a day, Cloudflare will returning a 502 Bad Gateway thinking that the Railway app is down. Looking at the request-ids and headers, we never see the request make it to our app. We ruled out the following:

  • Server deployments

  • Client calls (we have the CF Ray where it says origin host is down)

  • WAF events - requests go through Cloudflare security according to Event logs

  • Happens in our staging and production environments (more often in production though)

We're thinking it could be something with one of the following:

  • Railway load balancing with keep-alive connections

  • Railway firewall (?) - from what we understand there's no additional Cloudflare with our app and the internet

  • Network hiccups (but seems to often for that many network incidents)

Thanks.

0 Replies

owen-marit
PRO

4 months ago

2dabc326-678a-4f5e-8e12-6a40c656ef91


4 months ago

how are you starting your server?


owen-marit
PRO

4 months ago

funny you should ask, you provided the settings for me… NEW_RELIC_CONFIG_FILE=newrelic.ini NEW_RELIC_ENVIRONMENT=$ENVIRONMENT newrelic-admin run-program uvicorn app.main:app --host 0.0.0.0 --port $PORT --workers $UVICORN_WORKERS --loop uvloop --http httptools --limit-concurrency $UVICORN_LIMIT_CONCURRENCY --backlog $UVICORN_BACKLOG


owen-marit
PRO

4 months ago

UVICORNWORKERS=4 UVICORNLIMITCONCURRENCY=256 UVICORNBACKLOG=512


owen-marit
PRO

4 months ago

should we be setting --timeout-keep-alive (default 5 seconds)


4 months ago

the 502 im seeing happened in 57 milliseconds, so im not sure if it has to do with keep alives


4 months ago

something at the application layer is closing the connection, unfortunately i dont see anything sus here


owen-marit
PRO

4 months ago

yeah, today I only saw 4 502's…

1355345148371927000


4 months ago

i think its time to add more tracing into your app to help you find out whats going wrong


owen-marit
PRO

4 months ago

Yeah, I've added trace Ids and such, but nothing to trace on the app sinc the request never makes it.


owen-marit
PRO

4 months ago

There were more 502's earlier this week..

1355345579974070300


owen-marit
PRO

4 months ago

We instrumented from the caller (NextJS) as far as we can. It's a blackbox between Cloudflare and Railway now (above the app layer). The app never gets the receives the request. (We log every reuqest)


4 months ago

i assume the logging will only actually log if the request completes, but if something is going wrong at your application level to cause this, there may be no logs


owen-marit
PRO

4 months ago

Well we log all exceptions and have sentry/newrelic instrumentation as well


owen-marit
PRO

4 months ago

Do you have access or load balancer logs?


owen-marit
PRO

4 months ago

We are logging everything that reaches our instance.


4 months ago

our http logs are the same as we give you


owen-marit
PRO

4 months ago

Gotcha .


4 months ago

did you know you can expand the http logs to see more information


owen-marit
PRO

4 months ago

Are you talking about in railway logs or cloudflare ?


4 months ago

our http logs


owen-marit
PRO

4 months ago

Yup. We are ingesting them to newrelic. Let me check the railway console, maybe newrelic is dropping some logs.


4 months ago

how are you ingesting our http logs to new relic?


owen-marit
PRO

4 months ago

Sorry for the delay. We are using NewRelic to grab the output from stdout.


owen-marit
PRO

4 months ago

Oh, and I didn't realize you can open the http logs.. there's not much additional info:

1355371790389477600


4 months ago

you are looking at your logs, not our logs, try clicking the http logs tab


owen-marit
PRO

4 months ago

oh, i didn't realize there were separate HTTP logs…

1355376903233470500


owen-marit
PRO

4 months ago

nice.


owen-marit
PRO

4 months ago

Is there a way to filter by status?


4 months ago

yeah put a @ symbol into the search and you will get an option to filter by http status


owen-marit
PRO

4 months ago

Neat. Ok, I can see the 502 in the http logs. Yeah, there's a strange… responseDetails: "failed to forward request to upstream: connection closed unexpectedly"


owen-marit
PRO

4 months ago

And a missing timestamp for a log in the Deploy logs for that time.


4 months ago

yeah that means your logging and error handling did not capture whatever error caused that request to close prematurely


owen-marit
PRO

4 months ago

Yup. Thanks for the http logs. I didn’t realize those existed in the other tab. And looks like I can’t log drain those, only my app logs.


owen-marit
PRO

4 months ago

Is there any way to export the http logs other than manual exports?


4 months ago

no there arent, sorry


owen-marit
PRO

4 months ago

No worries. I read the feature request pages. Are there any policies against scraping the logs?


shreyaan
HOBBY

4 months ago

intresting. did you fix it?


4 months ago

you would want to use the API to grab them


owen-marit
PRO

4 months ago

It started happening when we increased workers to 16. We reduced to 4 and are not seeing any 502s yet. Possibly some connection keep alive issue occurring at the starlette connection handling before it’s handed off to our app.


shreyaan
HOBBY

4 months ago

i maybe off my base but isn't 16 workers alot tho. python is notoriously ram hungry and a worker can go upto 500mb if not optimized. so you could be saturating the 8gb ram railway has per service


shreyaan
HOBBY

4 months ago

that could explain y connection closes unexpectedly if it's saturating the ram


owen-marit
PRO

4 months ago

I thought each service has 32 GB of ram and scales appropriately? We were running about 300Mb per worker… here's our memory usage before and after…

1355587277937447200


owen-marit
PRO

4 months ago

Traffic was low so maybe workers were idling out? Not quite sure. I think there's something with keep alive settings between Railway HTTP load balancer and FastAPI Uvicorn, but not sure what the settings are for the HTTP load balancer. Are there recommend network settings?


4 months ago

the 502 I saw happened in 50 ms so timeouts do not come into play with such short requests


owen-marit
PRO

4 months ago

The other timeout was 0 ms, so I don't think it's timeout (like the request took too long), but possible connection re-use and timeout while managing connections in a pool/load balancer.


owen-marit
PRO

4 months ago

I totally agree with @Brody it’s something with our FastApi service. I’ll continue to monitor the workers and 502s. Thanks for all the help!



4 months ago

do you have max_requests?


owen-marit
PRO

4 months ago

No, we use only max concurrency and backlog.


owen-marit
PRO

4 months ago

FYI. Following up with FastAPI folks


owen-marit
PRO

3 months ago

Quick update here… switched to Gunicorn with Uvicorn workers and had the same issue, switched to Granian and occurrences are much less now (once a day). I think at this point it's something to do with NewRelic monitoring. Closing out this thread. Thanks all!