Intermittent WORKER TIMEOUTS -- appear like a client hanging up

rkm3

FREE

a year ago

(Sorry for the long post -- I did a lot of digging and wanted to show my work)
Attached image shows a large number of [DEBUG] Ignoring EPIPE captured while running e2e tests (filtered for just EPIPE messages)

Observations

We have a bit of a long running, but intermittent issue, where our app times out waiting on the client.
My hunch is that the network connection from the Railway proxy to our app hangs, the app patiently waits until the timeout, then aborts with the error [101] [ERROR] Error handling request (no URI read)
We can trigger timeouts when running Cypress end to end tests. Hypothesis is that issuing a lot of requests quickly makes the problem more visible.
The errors happen on different requests, they don’t appear to be spinning doing work. They appear to be waiting.
This happens in all of our Railway environments. It has never happened in local development (not running gunicorn)
New error from Gunicorn with debug logging on Ignoring EPIPE
- Found a gunicorn + nginx setting reading about this error proxy_ignore_client_abort
- https://github.com/benoitc/gunicorn/issues/1695 - discussion of Nginx and Gunicorn

Around the same time, we see an occaional Postgres log too 2024-05-22 18:55:27.971 UTC [402] LOG: could not receive data from client: Connection reset by peer

What steps have we taken

Added Sentry to Gunicorn to capture errors
Increased timeout to 120s
Increased workers to 2
Turned on Gunicorn debug logging

0 Replies

rkm3

FREE

a year ago

5d77fb37-1fba-4a55-bbd3-e0fab8243723

rkm3

FREE

a year ago

Reverse chronological posts with related issues

4/9/2024 Worker timeout

https://discord.com/channels/713503345364697088/1227240036169416765/1227240036169416765
Person didn’t follow up, no resolution

3/12/2024 [CRITICAL] WORKER TIMEOUT (pid:27)

https://discord.com/channels/713503345364697088/1217311105173426217/1217311105173426217
Customer complains of infrequent timeouts

- Brody - none of those are error messages, please add error handling and verbose debug logging to your application code so that you can track any issues or anomalies, without such logs there is unfortunately not much I can help with here

2/5/2024 Django with Gunicorn gives frequent [CRITICAL] WORKER TIMEOUTs on simple requests

https://discord.com/channels/713503345364697088/1204045940286160967/1204045940286160967
Intermittent issues
Brody - one customer had bad db credentials
Problem with Whitenoise static files, one person moved to Google Cloud Static file hosting

12/30/2023 [CRITICAL] WORKER TIMEOUT (pid:13) error

https://discord.com/channels/713503345364697088/1190838282834153632/1190838282834153632
Increase timeouts and threads seemed to solve

10/25/2023 Sudden gunicorn errors. App unusable.

https://discord.com/channels/713503345364697088/1166657973716783165/1166657973716783165
Increased workers and threads --workers=4 --threads=4

5/17/2023 [CRITICAL] WORKER TIMEOUT (pid:155)

https://discord.com/channels/713503345364697088/1108320712239886357/1108320712239886357
Longer timeout for api calls to OpenAPI

5/2/2023 Flask application contains deploy error WORKER TIMEOUT

https://discord.com/channels/713503345364697088/1102971338387378227/1102971338387378227
Increase timeout for slow loading app during deploy

4/10/2023 Increasing Worker Timeout

https://discord.com/channels/713503345364697088/1095208516995792987/1095208516995792987
Brody suggests setting PYTHONUNBUFFERED=1 in the variables pane

rkm3

FREE

a year ago

3/23/2023 CRITICAL WORKER TIMEOUT ISSUE

- https://discord.com/channels/713503345364697088/1088693739163234424/1088693739163234424

12/20/2022 worker timeout

https://discord.com/channels/713503345364697088/1054724293512151090/1054724293512151090
increased timeout in gunicorn, didn’t help, thought it was a memory kill error

rkm3

FREE

a year ago

^ Showing the messages and I read and summarized before starting this chat.

rkm3

FREE

a year ago

Why this problem is hard

Troubleshooting intermittent problems is always hard, especially when it's hard to reproduce them.
When running our e2e tests this morning, I was able to get the app to hang several times, when trying again right now, it ran smoothly.

rkm3

FREE

a year ago

Here's a recent stack trace showing where this is happening in gunicorn.

[2024-05-21 20:57:59 +0000] [101] [INFO] Booting worker with pid: 101

[2024-05-21 21:00:00 +0000] [22] [CRITICAL] WORKER TIMEOUT (pid:101)

[2024-05-21 21:00:00 +0000] [101] [ERROR] Error handling request (no URI read)

Traceback (most recent call last):
File "/opt/venv/lib/python3.11/site-packages/gunicorn/workers/sync.py", line 134, in handle
req = next(parser)
^^^^^^^^^^^^

File "/opt/venv/lib/python3.11/site-packages/gunicorn/http/parser.py", line 42, in __next__
self.mesg = self.mesg_class(self.cfg, self.unreader, self.source_addr, self.req_count)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

File "/opt/venv/lib/python3.11/site-packages/gunicorn/http/message.py", line 257, in __init__

super().__init__(cfg, unreader, peer_addr)

File "/opt/venv/lib/python3.11/site-packages/gunicorn/http/message.py", line 60, in __init__
unused = self.parse(self.unreader)
^^^^^^^^^^^^^^^^^^^^^^^^^

File "/opt/venv/lib/python3.11/site-packages/gunicorn/http/message.py", line 269, in parse

self.get_data(unreader, buf, stop=True)

File "/opt/venv/lib/python3.11/site-packages/gunicorn/http/message.py", line 260, in get_data
data = unreader.read()
^^^^^^^^^^^^^^^

File "/opt/venv/lib/python3.11/site-packages/gunicorn/http/unreader.py", line 37, in read
d = self.chunk()
^^^^^^^^^^^^

File "/opt/venv/lib/python3.11/site-packages/gunicorn/http/unreader.py", line 64, in chunk
return self.sock.recv(self.mxchunk)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^

File "/opt/venv/lib/python3.11/site-packages/gunicorn/workers/base.py", line 203, in handle_abort

sys.exit(1)

SystemExit: 1

[2024-05-21 21:00:00 +0000] [101] [INFO] Worker exiting (pid: 101)
Sentry is attempting to send 2 pending events
Waiting up to 2 seconds
Press Ctrl-C to quit
[2024-05-21 21:00:01 +0000] [103] [INFO] Booting worker with pid: 103

rkm3

FREE

a year ago

Worker timeout error in sentry. First seen 7 months ago.

1242920309208649700

rkm3

FREE

a year ago

(also if there's a better venue or way to ask about this problem, please let me know)