Intermittent WORKER TIMEOUTS -- appear like a client hanging up

rkm3FREE

a year ago

(Sorry for the long post -- I did a lot of digging and wanted to show my work)
Attached image shows a large number of [DEBUG] Ignoring EPIPE captured while running e2e tests (filtered for just EPIPE messages)

Observations

  • We have a bit of a long running, but intermittent issue, where our app times out waiting on the client.

  • My hunch is that the network connection from the Railway proxy to our app hangs, the app patiently waits until the timeout, then aborts with the error [101] [ERROR] Error handling request (no URI read)

  • We can trigger timeouts when running Cypress end to end tests. Hypothesis is that issuing a lot of requests quickly makes the problem more visible.

  • The errors happen on different requests, they don’t appear to be spinning doing work. They appear to be waiting.

  • This happens in all of our Railway environments. It has never happened in local development (not running gunicorn)

  • New error from Gunicorn with debug logging on Ignoring EPIPE

Around the same time, we see an occaional Postgres log too 2024-05-22 18:55:27.971 UTC [402] LOG: could not receive data from client: Connection reset by peer

What steps have we taken

  • Added Sentry to Gunicorn to capture errors

  • Increased timeout to 120s

  • Increased workers to 2

  • Turned on Gunicorn debug logging

0 Replies

rkm3FREE

a year ago

5d77fb37-1fba-4a55-bbd3-e0fab8243723


rkm3FREE

a year ago

Reverse chronological posts with related issues

4/9/2024 Worker timeout

3/12/2024 [CRITICAL] WORKER TIMEOUT (pid:27)

- Brody - none of those are error messages, please add error handling and verbose debug logging to your application code so that you can track any issues or anomalies, without such logs there is unfortunately not much I can help with here

2/5/2024 Django with Gunicorn gives frequent [CRITICAL] WORKER TIMEOUTs on simple requests

12/30/2023 [CRITICAL] WORKER TIMEOUT (pid:13) error

10/25/2023 Sudden gunicorn errors. App unusable.

5/17/2023 [CRITICAL] WORKER TIMEOUT (pid:155)

5/2/2023 Flask application contains deploy error WORKER TIMEOUT

4/10/2023 Increasing Worker Timeout


rkm3FREE

a year ago

3/23/2023 CRITICAL WORKER TIMEOUT ISSUE

- https://discord.com/channels/713503345364697088/1088693739163234424/1088693739163234424

12/20/2022 worker timeout


rkm3FREE

a year ago

^ Showing the messages and I read and summarized before starting this chat.


rkm3FREE

a year ago

Why this problem is hard

Troubleshooting intermittent problems is always hard, especially when it's hard to reproduce them.
When running our e2e tests this morning, I was able to get the app to hang several times, when trying again right now, it ran smoothly.


rkm3FREE

a year ago

Here's a recent stack trace showing where this is happening in gunicorn.

[2024-05-21 20:57:59 +0000] [101] [INFO] Booting worker with pid: 101

[2024-05-21 21:00:00 +0000] [22] [CRITICAL] WORKER TIMEOUT (pid:101)

[2024-05-21 21:00:00 +0000] [101] [ERROR] Error handling request (no URI read)

Traceback (most recent call last):
File "/opt/venv/lib/python3.11/site-packages/gunicorn/workers/sync.py", line 134, in handle
req = next(parser)
^^^^^^^^^^^^

File "/opt/venv/lib/python3.11/site-packages/gunicorn/http/parser.py", line 42, in __next__
self.mesg = self.mesg_class(self.cfg, self.unreader, self.source_addr, self.req_count)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

File "/opt/venv/lib/python3.11/site-packages/gunicorn/http/message.py", line 257, in __init__

super().__init__(cfg, unreader, peer_addr)

File "/opt/venv/lib/python3.11/site-packages/gunicorn/http/message.py", line 60, in __init__
unused = self.parse(self.unreader)
^^^^^^^^^^^^^^^^^^^^^^^^^

File "/opt/venv/lib/python3.11/site-packages/gunicorn/http/message.py", line 269, in parse

self.get_data(unreader, buf, stop=True)

File "/opt/venv/lib/python3.11/site-packages/gunicorn/http/message.py", line 260, in get_data
data = unreader.read()
^^^^^^^^^^^^^^^

File "/opt/venv/lib/python3.11/site-packages/gunicorn/http/unreader.py", line 37, in read
d = self.chunk()
^^^^^^^^^^^^

File "/opt/venv/lib/python3.11/site-packages/gunicorn/http/unreader.py", line 64, in chunk
return self.sock.recv(self.mxchunk)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^

File "/opt/venv/lib/python3.11/site-packages/gunicorn/workers/base.py", line 203, in handle_abort

sys.exit(1)

SystemExit: 1

[2024-05-21 21:00:00 +0000] [101] [INFO] Worker exiting (pid: 101)
Sentry is attempting to send 2 pending events
Waiting up to 2 seconds
Press Ctrl-C to quit
[2024-05-21 21:00:01 +0000] [103] [INFO] Booting worker with pid: 103

rkm3FREE

a year ago

Worker timeout error in sentry. First seen 7 months ago.

1242920309208649700


rkm3FREE

a year ago

(also if there's a better venue or way to ask about this problem, please let me know)


a year ago

first off, a lot of those threads you linked were just user errors but they didn't have any logging of any kind so they had a very hard time narrowing down the issue.

second, I really do appreciate all the work you've put into this issue so far!

something quick I want you to test, turn on the new edge proxy in your service settings, wait 10 minutes, and try your tests again


rkm3FREE

a year ago

Hi Brody. I saw you on a lot of those threads and saw you asking for more logging and more details. I tried to come prepared to asking for help.


a year ago

and you sure did!


rkm3FREE

a year ago

For the new edge proxy, would you do all envs or just some?


rkm3FREE

a year ago

We have about ~5 open PR envs, a pre-production env, and a production env.


rkm3FREE

a year ago

I'm definitely down to try new networking because this feels like a hangup at the network layer.


a year ago

I guess I'd enable it on the pre-prod? but it's your call


rkm3FREE

a year ago

That works for me. I definitely like testing outside of prod before flipping switches.


a year ago

it's also much easier to make tweaks to the new proxy! so it's beneficial for you to be on that Instead of envoy


rkm3FREE

a year ago

Ok, do you need anything from me? Otherwise I'd say please proceed with moving the non-production environments to the new edge proxy.


a year ago

oh you don't need me to do anything, go into your service settings and enable the new proxy


rkm3FREE

a year ago

Not every EPIPE has a corresponding WORKER TIMEOUT, but here's a pretty clear one from our e2e tests in a PR environment

1242923598067400700


a year ago

also, I don't work for Railway haha


rkm3FREE

a year ago

Glad i asked.


rkm3FREE

a year ago

Wait, how have they not hired you?


a year ago

not qualified


rkm3FREE

a year ago

Hero status isn't enough?


a year ago

lmao


rkm3FREE

a year ago

Project > Settings > Edge Proxy

1242924170426318800


a year ago

that's it!


rkm3FREE

a year ago

(posting for posterity)


a year ago

id say you should wait a safe 10 minutes before you are switched over, DNS updates take a bit to propagate


a year ago

so that you know you are on the new proxy you can inspect the Server header in the response headers of a request -
railway-edge -> edge proxy
railway -> envoy (old proxy)


rkm3FREE

a year ago

Gotta love the DNS magic. I'll get a cup of coffee (and send you a 🚅 ☕)


a year ago

thank you very much! ❤️


rkm3FREE

a year ago

I'm excited about the edge proxy. While waiting for DNS, if it doesn't make this go away, what are our next options? You mentioned something about it being more configurable?


rkm3FREE

a year ago

it's also much easier to make tweaks to the new proxy


a year ago

possibly bring mig in, the person who built the new proxy


a year ago

hey Ryan did the new proxy solve or change anything?


rkm3FREE

a year ago

So far so good. Cypress has been running faster than ever and not getting hung up like it used to nearly every run. I'm about ready to flip on the new proxy in production.


a year ago

that's awesome news


rkm3FREE

a year ago

Yeah, this has been bugging me for a while. I've even been looking at my home network configs, thinking it was something with my complicated IOT setup.


a year ago

it's always the worst when you start doubting your own hardware