Webhook endpoint experiencing 4-7s delays in request body delivery
kovesamui
PROOP

16 days ago

Service: LINE Translation Bot (Python/Flask/Gunicorn)
Region: Southeast Asia
Started: ~8+ hours ago, no code changes preceded the issue

Problem:
Our LINE Messaging API webhook endpoint is timing out. LINE requires webhook responses within ~5 seconds, but our endpoint is taking 4-7 seconds - almost entirely spent reading the request body from the WSGI input stream.

Diagnostic Evidence:
We added timing instrumentation around request.get_data() (Flask/Werkzeug reading the body from gunicorn's WSGI input). Results:

[WEBHOOK-DIAG] request.get_data() took 4310ms (body=848 bytes)
[WEBHOOK-SLOW] Response took 4322ms | body_read=4310ms, handler=11ms, overhead=1ms

[WEBHOOK-DIAG] request.get_data() took 6492ms (body=848 bytes)
[WEBHOOK-SLOW] Response took 6496ms | body_read=6492ms, handler=2ms, overhead=2ms

Key observations:

  • request.get_data() takes 4,310–6,492ms for an 848-byte POST body

  • Our application handler code runs in 2–11ms

  • Overhead (signature check, logging) is 1–2ms

  • 99.5%+ of total latency is in body delivery, not application code

  • This affects tiny payloads (848 bytes) — not a large body issue

  • The issue started suddenly with no code or config changes on our end

  • The service ran fine for weeks prior

What we've ruled out:

  1. Application code — handler executes in 2-11ms, confirmed via instrumentation

  2. Gunicorn worker exhaustion — tested with both sync workers (4) and gthread (4 workers × 4 threads), same result

  3. Handler blocking — all event handlers fast-pathed to return immediately (queue to background worker or thread pool)

  4. Redelivery cascade — even fresh single-event webhooks show the same 4-7s body read delay

Configuration:

web: gunicorn app:app --bind 0.0.0.0:$PORT --workers 4 --timeout 30 --max-requests 1000 --max-requests-jitter 200 --log-level info

Request: Could you investigate whether there's a proxy/networking issue causing delayed request body forwarding to our gunicorn processes? The delay pattern (4-7s for sub-1KB bodies) suggests something between Railway's edge proxy and our application container is buffering or stalling.

$20 Bounty

4 Replies

kovesamui
PROOP

16 days ago

Follow-up with additional diagnostic data:

After deploying timing instrumentation, we've confirmed the bottleneck is in request body delivery from your proxy to our gunicorn process:

request.get_data() took 7082ms (body=862 bytes) — handler=24ms
request.get_data() took 4066ms (body=879 bytes) — handler=0ms
request.get_data() took 2699ms (body=948 bytes) — handler=0ms
request.get_data() took 1543ms (body=924 bytes) — handler=10ms

Sub-1KB bodies, 1.5-7s to read. Our handler code is 0-24ms consistently. Switching to gevent workers mitigated the queuing impact but the underlying proxy delay persists.


15 days ago

Hello,

There are no buffers or stalls on our Asia proxies. I have tested here by making a request to the same proxy that handled some of your longer /webhook requests.

The application read 848 bytes from our Asia proxy in 8.95 microseconds.

Note: I do not live in or around Asia, so I had to use a VPN, which explains the longer total request time.

Since this isn't an issue on our end, I will open this thread up for community involvement.

Attachments


Status changed to Awaiting User Response Railway 15 days ago


grandmaster451
FREE

15 days ago

Since Railway's proxy confirmed it delivers the body in microseconds, the delay is happening somewhere in the TCP/WSGI layer between the proxy and your gunicorn process. A few things to investigate:

1. Check if LINE sends Transfer-Encoding: chunked

Log the raw headers of incoming LINE webhooks: print(dict(request.headers))

If LINE uses chunked transfer encoding, gunicorn sync workers can be slow to buffer it. Try adding --forwarded-allow-ips="*" to your gunicorn command, and also try --worker-class=gthread explicitly.

2. Try switching to an async worker

Replace sync workers with gevent or eventlet:

RUN pip install gevent

gunicorn app:app --worker-class=gevent --workers=4 --timeout=30

You mentioned gevent "mitigated queuing" — if body read time still shows 1-7s, the bottleneck may be TCP receive window or slow network path from LINE's Japan servers to SE Asia Railway.

3. Add a Content-Length check

Line sends Content-Length header. Check if request.content_length is set. If it's missing or wrong, Werkzeug may read until timeout instead of reading the exact byte count.

4. Use request.stream instead of request.get_data()

body = request.stream.read()

This bypasses Werkzeug's buffering logic and reads directly from the WSGI stream, which may be faster in some cases.

5. Check LINE's outbound IPs and routing

LINE webhooks originate from Japan. The network path Japan -> Singapore can occasionally degrade. Check if the issue correlates with specific times of day (peak traffic hours in Japan/SE Asia).


kovesamui
PROOP

15 days ago

Thanks Brody and @grandmaster451 for the thorough investigation - really appreciate the help.

Brody's proxy test (848 bytes in 8.95μs) confirms Railway isn't the bottleneck, which is reassuring. And @grandmaster451, your point #5 about the Japan → Singapore network path is what we're now concluding was the issue.

Our setup: we're a LINE Messaging API bot receiving webhooks from LINE's servers in Japan, running on Railway's Singapore region. LINE has a fairly tight webhook timeout (~3-5 seconds for the full roundtrip). When we reviewed our own application logs during the error period, everything was healthy - body reads completing in <1ms, zero errors, queue depths normal. The timeouts were happening before requests even reached our app.

Best theory: transient network degradation on the Japan → Singapore path (or LINE's outbound webhook infra under load). The errors gradually tapered off overnight and haven't returned during today's peak hours so far.

What we've done on our end:

  • Switched to gevent workers to handle any slow connections gracefully

  • Added diagnostic logging for Content-Length/Transfer-Encoding on slow body reads (>100ms) so we'll have better data if it recurs

  • Added a reliability tracking dashboard to monitor first-delivery success rates going forward

  • Disabled LINE's webhook redelivery to prevent retry amplification if it happens again

Since Railway doesn't offer a Japan region (which we assume would be the ideal fix for minimizing the network path to LINE), we'll keep monitoring and treat this as internet weather for now. Thanks again for ruling out the proxy side - saved us a lot of time chasing the wrong lead.


Loading...