Incident Report — Redis ECONNRESET affecting BullMQ and Lock Service

Question

**Project/Service:** Psych-PROD-BACK  
**Environment:** Production  
**Date:** 2026-03-19  
**Issue Type:** Intermittent Redis connectivity resets (`ECONNRESET`)

### **Summary**

We are experiencing intermittent Redis connection resets in production.  
After adding explicit Redis connection instrumentation in the backend, we confirmed the issue is primarily affecting the BullMQ Redis client, with secondary impact on our distributed lock Redis client.

### **Evidence (from application logs)**

Time window analyzed: around `2026-03-19T12:59:41Z` to `2026-03-19T13:00:31Z`

* `Error: read ECONNRESET`: **48**
* `queues.redis_client_error`: **13**
* `queues.redis_client_reconnecting`: **14**
* `queues.redis_client_connect`: **13**
* `queues.redis_client_ready`: **13**
* `queues.redis_client_close`: **13**
* `distributed_lock.redis_error`: **2**
* `distributed_lock.redis_reconnecting`: **2**

All occurrences in this sample were tied to replica:

* `941496eb-57cb-43df-88e9-b6536c78d28e`

### **Interpretation**

* The resets are not isolated to a single custom Redis usage; they affect BullMQ Redis clients directly.
* The pattern shows repeated short disconnections/reconnections (flapping).
* Since events are concentrated on one replica, there may be node/replica-level network instability or routing issue between app instance and Redis.

### **What we already did**

* Redeployed backend service.
* Added Redis connection observability for:  
   * Distributed lock client (`connect`, `ready`, `reconnecting`, `close`, `end`, `error`)  
   * BullMQ queue Redis clients (`connect`, `ready`, `reconnecting`, `close`, `end`, `error`)
* Confirmed app-side reconnection logic is active; errors persist intermittently.

### **Request for investigation**

Please investigate infrastructure/network path stability between this service replica and Redis, including:

1. Any network incidents around:  
   * `2026-03-19T12:59:41Z`  
   * `2026-03-19T13:00:07Z`
2. Replica/node-level anomalies for:  
   * `941496eb-57cb-43df-88e9-b6536c78d28e`
3. Redis-side resets, failover events, or connection interruptions.
4. Potential cross-zone/region routing instability for this service.

### **Additional context**

* We can provide raw exported logs file: `logs.1773925270454.json`
* We also observed prior similar incidents in earlier windows on the same day.