2 months ago
Project/Service: Psych-PROD-BACK
Environment: Production
Date: 2026-03-19
Issue Type: Intermittent Redis connectivity resets (ECONNRESET)
Summary
We are experiencing intermittent Redis connection resets in production.
After adding explicit Redis connection instrumentation in the backend, we confirmed the issue is primarily affecting the BullMQ Redis client, with secondary impact on our distributed lock Redis client.
Evidence (from application logs)
Time window analyzed: around 2026-03-19T12:59:41Z to 2026-03-19T13:00:31Z
Error: read ECONNRESET: 48queues.redis_client_error: 13queues.redis_client_reconnecting: 14queues.redis_client_connect: 13queues.redis_client_ready: 13queues.redis_client_close: 13distributed_lock.redis_error: 2distributed_lock.redis_reconnecting: 2
All occurrences in this sample were tied to replica:
941496eb-57cb-43df-88e9-b6536c78d28e
Interpretation
- The resets are not isolated to a single custom Redis usage; they affect BullMQ Redis clients directly.
- The pattern shows repeated short disconnections/reconnections (flapping).
- Since events are concentrated on one replica, there may be node/replica-level network instability or routing issue between app instance and Redis.
What we already did
- Redeployed backend service.
- Added Redis connection observability for:
- Distributed lock client (
connect,ready,reconnecting,close,end,error) - BullMQ queue Redis clients (
connect,ready,reconnecting,close,end,error)
- Distributed lock client (
- Confirmed app-side reconnection logic is active; errors persist intermittently.
Request for investigation
Please investigate infrastructure/network path stability between this service replica and Redis, including:
- Any network incidents around:
2026-03-19T12:59:41Z2026-03-19T13:00:07Z
- Replica/node-level anomalies for:
941496eb-57cb-43df-88e9-b6536c78d28e
- Redis-side resets, failover events, or connection interruptions.
- Potential cross-zone/region routing instability for this service.
Additional context
- We can provide raw exported logs file:
logs.1773925270454.json - We also observed prior similar incidents in earlier windows on the same day.
1 Replies
Status changed to Awaiting Railway Response Railway • 2 months ago
2 months ago
We confirmed the ECONNRESET errors in your backend logs spanning roughly 12:51 to 13:02 UTC on March 19, with the connection flapping pattern you described. The errors appear to have stopped after your redeploy, with no further ECONNRESET occurrences in the logs after that window. Both of your Redis services are currently healthy. We've noted this for our infrastructure team to review the network path stability during that time window. If the issue recurs, redeploying the Redis service(s) is a known workaround that can help restore connectivity after transient network disruptions.
Status changed to Awaiting User Response Railway • 2 months ago
2 months ago
This thread has been marked as solved automatically due to a lack of recent activity. Please re-open this thread or create a new one if you require further assistance. Thank you!
Status changed to Solved Railway • about 2 months ago