Incident Report — Redis ECONNRESET affecting BullMQ and Lock Service
guilhermeprado001
HOBBYOP

2 months ago

Project/Service: Psych-PROD-BACK

Environment: Production

Date: 2026-03-19

Issue Type: Intermittent Redis connectivity resets (ECONNRESET)

Summary

We are experiencing intermittent Redis connection resets in production.

After adding explicit Redis connection instrumentation in the backend, we confirmed the issue is primarily affecting the BullMQ Redis client, with secondary impact on our distributed lock Redis client.

Evidence (from application logs)

Time window analyzed: around 2026-03-19T12:59:41Z to 2026-03-19T13:00:31Z

  • Error: read ECONNRESET: 48
  • queues.redis_client_error: 13
  • queues.redis_client_reconnecting: 14
  • queues.redis_client_connect: 13
  • queues.redis_client_ready: 13
  • queues.redis_client_close: 13
  • distributed_lock.redis_error: 2
  • distributed_lock.redis_reconnecting: 2

All occurrences in this sample were tied to replica:

  • 941496eb-57cb-43df-88e9-b6536c78d28e

Interpretation

  • The resets are not isolated to a single custom Redis usage; they affect BullMQ Redis clients directly.
  • The pattern shows repeated short disconnections/reconnections (flapping).
  • Since events are concentrated on one replica, there may be node/replica-level network instability or routing issue between app instance and Redis.

What we already did

  • Redeployed backend service.
  • Added Redis connection observability for:
    • Distributed lock client (connect, ready, reconnecting, close, end, error)
    • BullMQ queue Redis clients (connect, ready, reconnecting, close, end, error)
  • Confirmed app-side reconnection logic is active; errors persist intermittently.

Request for investigation

Please investigate infrastructure/network path stability between this service replica and Redis, including:

  1. Any network incidents around:
    • 2026-03-19T12:59:41Z
    • 2026-03-19T13:00:07Z
  2. Replica/node-level anomalies for:
    • 941496eb-57cb-43df-88e9-b6536c78d28e
  3. Redis-side resets, failover events, or connection interruptions.
  4. Potential cross-zone/region routing instability for this service.

Additional context

  • We can provide raw exported logs file: logs.1773925270454.json
  • We also observed prior similar incidents in earlier windows on the same day.
Solved

1 Replies

Status changed to Awaiting Railway Response Railway 2 months ago


sam-a
EMPLOYEE

2 months ago

We confirmed the ECONNRESET errors in your backend logs spanning roughly 12:51 to 13:02 UTC on March 19, with the connection flapping pattern you described. The errors appear to have stopped after your redeploy, with no further ECONNRESET occurrences in the logs after that window. Both of your Redis services are currently healthy. We've noted this for our infrastructure team to review the network path stability during that time window. If the issue recurs, redeploying the Redis service(s) is a known workaround that can help restore connectivity after transient network disruptions.


Status changed to Awaiting User Response Railway 2 months ago


Railway
BOT

2 months ago

This thread has been marked as solved automatically due to a lack of recent activity. Please re-open this thread or create a new one if you require further assistance. Thank you!

Status changed to Solved Railway about 2 months ago


Welcome!

Sign in to your Railway account to join the conversation.

Loading...