a month ago
Hi Railway team,
Our Django web service has been intermittently broken since ~07:00 UTC today (2026-05-20).
Symptoms
-
GET requests on public pages: usually fast (40–250 ms, occasionally times out at 15 s).
-
POST requests on the login endpoint (/auth/login/client/): consistently hang
until Railway edge closes the connection at exactly 15 s, then return
"connection closed". Subsequent attempts time out at 30–60 s.
-
This started after a routine
git pushdeploy, not after any infra changeon our side. A force-redeploy (empty commit) did not resolve it.
The 15 s cutoff matches Railway's edge proxy timeout, which suggests our
gunicorn workers are blocked on something (likely a backing service) rather
than crashing with a stack trace.
What we already verified on our side
-
Django boots, serves the login page, CSRF tokens are issued normally.
-
DB queries on GET work (page render includes DB-backed content).
-
We added aggressive Redis
SOCKET_CONNECT_TIMEOUT/SOCKET_TIMEOUT(2 s)with
IGNORE_EXCEPTIONS: Trueso the cache fails fast on Redis trouble.Latest commit deployed: Behavior unchanged after deploy.
Strong hypothesis
The POST path is the first place our code touches Redis (rate-limit check
via django.core.cache). The fact that GET works while POST hangs 15 s+
fits a Redis service that accepts TCP connections but never responds. Could
you please check:
-
Health of the Redis service attached to our project (memory, CPU,
connection saturation, restarts).
-
Whether there is any networking incident between our web service and the
Redis service in our project's region.
-
The current state of our latest deployment (commit 471b810).
Project info
- Web service URL: https://web-production-3b6ab.up.railway.app
1 Replies
a month ago
Your service's symptoms align with a major platform-wide service disruption that ran from 02:25 UTC to 07:57 UTC today, affecting traffic and networking across all regions. Your logs confirm that gunicorn workers are crashing because the Redis client cannot establish a TCP connection, which matches the internal networking disruption during that window. The incident is now resolved. If your service is still experiencing Redis connection failures, redeploying both the Redis service and the web service (in that order) should re-establish connectivity, as the Redis service hasn't been redeployed since March and may have stale network state from the disruption.
Status changed to Awaiting User Response Railway • about 1 month ago
Status changed to Solved unix0899 • about 1 month ago