a year ago
(Sorry for the long post -- I did a lot of digging and wanted to show my work)
Attached image shows a large number of [DEBUG] Ignoring EPIPE
captured while running e2e tests (filtered for just EPIPE messages)
Observations
We have a bit of a long running, but intermittent issue, where our app times out waiting on the client.
My hunch is that the network connection from the Railway proxy to our app hangs, the app patiently waits until the timeout, then aborts with the error
[101] [ERROR] Error handling request (no URI read)
We can trigger timeouts when running Cypress end to end tests. Hypothesis is that issuing a lot of requests quickly makes the problem more visible.
The errors happen on different requests, they don’t appear to be spinning doing work. They appear to be waiting.
This happens in all of our Railway environments. It has never happened in local development (not running gunicorn)
New error from Gunicorn with debug logging on
Ignoring EPIPE
Found a gunicorn + nginx setting reading about this error
proxy_ignore_client_abort
https://github.com/benoitc/gunicorn/issues/1695 - discussion of Nginx and Gunicorn
Around the same time, we see an occaional Postgres log too 2024-05-22 18:55:27.971 UTC [402] LOG: could not receive data from client: Connection reset by peer
What steps have we taken
Added Sentry to Gunicorn to capture errors
Increased timeout to 120s
Increased workers to 2
Turned on Gunicorn debug logging
0 Replies
Reverse chronological posts with related issues
4/9/2024 Worker timeout
https://discord.com/channels/713503345364697088/1227240036169416765/1227240036169416765
Person didn’t follow up, no resolution
3/12/2024 [CRITICAL] WORKER TIMEOUT (pid:27)
https://discord.com/channels/713503345364697088/1217311105173426217/1217311105173426217
Customer complains of infrequent timeouts
- Brody - none of those are error messages, please add error handling and verbose debug logging to your application code so that you can track any issues or anomalies, without such logs there is unfortunately not much I can help with here
2/5/2024 Django with Gunicorn gives frequent [CRITICAL] WORKER TIMEOUTs on simple requests
https://discord.com/channels/713503345364697088/1204045940286160967/1204045940286160967
Intermittent issues
Brody - one customer had bad db credentials
Problem with Whitenoise static files, one person moved to Google Cloud Static file hosting
12/30/2023 [CRITICAL] WORKER TIMEOUT (pid:13) error
https://discord.com/channels/713503345364697088/1190838282834153632/1190838282834153632
Increase timeouts and threads seemed to solve
10/25/2023 Sudden gunicorn errors. App unusable.
https://discord.com/channels/713503345364697088/1166657973716783165/1166657973716783165
Increased workers and threads
--workers=4 --threads=4
5/17/2023 [CRITICAL] WORKER TIMEOUT (pid:155)
https://discord.com/channels/713503345364697088/1108320712239886357/1108320712239886357
Longer timeout for api calls to OpenAPI
5/2/2023 Flask application contains deploy error WORKER TIMEOUT
https://discord.com/channels/713503345364697088/1102971338387378227/1102971338387378227
Increase timeout for slow loading app during deploy
4/10/2023 Increasing Worker Timeout
https://discord.com/channels/713503345364697088/1095208516995792987/1095208516995792987
Brody suggests setting
PYTHONUNBUFFERED
=1 in the variables pane
3/23/2023 CRITICAL WORKER TIMEOUT ISSUE
- https://discord.com/channels/713503345364697088/1088693739163234424/1088693739163234424
12/20/2022 worker timeout
https://discord.com/channels/713503345364697088/1054724293512151090/1054724293512151090
increased timeout in gunicorn, didn’t help, thought it was a memory kill error
Why this problem is hard
Troubleshooting intermittent problems is always hard, especially when it's hard to reproduce them.
When running our e2e tests this morning, I was able to get the app to hang several times, when trying again right now, it ran smoothly.
Here's a recent stack trace showing where this is happening in gunicorn.
[2024-05-21 20:57:59 +0000] [101] [INFO] Booting worker with pid: 101
[2024-05-21 21:00:00 +0000] [22] [CRITICAL] WORKER TIMEOUT (pid:101)
[2024-05-21 21:00:00 +0000] [101] [ERROR] Error handling request (no URI read)
Traceback (most recent call last):
File "/opt/venv/lib/python3.11/site-packages/gunicorn/workers/sync.py", line 134, in handle
req = next(parser)
^^^^^^^^^^^^
File "/opt/venv/lib/python3.11/site-packages/gunicorn/http/parser.py", line 42, in __next__
self.mesg = self.mesg_class(self.cfg, self.unreader, self.source_addr, self.req_count)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/venv/lib/python3.11/site-packages/gunicorn/http/message.py", line 257, in __init__
super().__init__(cfg, unreader, peer_addr)
File "/opt/venv/lib/python3.11/site-packages/gunicorn/http/message.py", line 60, in __init__
unused = self.parse(self.unreader)
^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/venv/lib/python3.11/site-packages/gunicorn/http/message.py", line 269, in parse
self.get_data(unreader, buf, stop=True)
File "/opt/venv/lib/python3.11/site-packages/gunicorn/http/message.py", line 260, in get_data
data = unreader.read()
^^^^^^^^^^^^^^^
File "/opt/venv/lib/python3.11/site-packages/gunicorn/http/unreader.py", line 37, in read
d = self.chunk()
^^^^^^^^^^^^
File "/opt/venv/lib/python3.11/site-packages/gunicorn/http/unreader.py", line 64, in chunk
return self.sock.recv(self.mxchunk)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/venv/lib/python3.11/site-packages/gunicorn/workers/base.py", line 203, in handle_abort
sys.exit(1)
SystemExit: 1
[2024-05-21 21:00:00 +0000] [101] [INFO] Worker exiting (pid: 101)
Sentry is attempting to send 2 pending events
Waiting up to 2 seconds
Press Ctrl-C to quit
[2024-05-21 21:00:01 +0000] [103] [INFO] Booting worker with pid: 103
(also if there's a better venue or way to ask about this problem, please let me know)
a year ago
first off, a lot of those threads you linked were just user errors but they didn't have any logging of any kind so they had a very hard time narrowing down the issue.
second, I really do appreciate all the work you've put into this issue so far!
something quick I want you to test, turn on the new edge proxy in your service settings, wait 10 minutes, and try your tests again
Hi Brody. I saw you on a lot of those threads and saw you asking for more logging and more details. I tried to come prepared to asking for help.
a year ago
and you sure did!
I'm definitely down to try new networking because this feels like a hangup at the network layer.
a year ago
I guess I'd enable it on the pre-prod? but it's your call
That works for me. I definitely like testing outside of prod before flipping switches.
a year ago
it's also much easier to make tweaks to the new proxy! so it's beneficial for you to be on that Instead of envoy
Ok, do you need anything from me? Otherwise I'd say please proceed with moving the non-production environments to the new edge proxy.
a year ago
oh you don't need me to do anything, go into your service settings and enable the new proxy
Not every EPIPE has a corresponding WORKER TIMEOUT, but here's a pretty clear one from our e2e tests in a PR environment
a year ago
also, I don't work for Railway haha
a year ago
not qualified
a year ago
lmao
a year ago
that's it!
a year ago
id say you should wait a safe 10 minutes before you are switched over, DNS updates take a bit to propagate
a year ago
so that you know you are on the new proxy you can inspect the Server
header in the response headers of a request -railway-edge
-> edge proxyrailway
-> envoy (old proxy)
a year ago
thank you very much! ❤️
I'm excited about the edge proxy. While waiting for DNS, if it doesn't make this go away, what are our next options? You mentioned something about it being more configurable?
a year ago
possibly bring mig in, the person who built the new proxy
a year ago
hey Ryan did the new proxy solve or change anything?
So far so good. Cypress has been running faster than ever and not getting hung up like it used to nearly every run. I'm about ready to flip on the new proxy in production.
a year ago
that's awesome news
Yeah, this has been bugging me for a while. I've even been looking at my home network configs, thinking it was something with my complicated IOT setup.
a year ago
it's always the worst when you start doubting your own hardware