21 days ago
Hi, I received this email yesterday, it was an outage with my redis broker connection for my celery worker. The email said there was no action needed on my part so I ignored it, however I woke up to many clients today telling me our celery tasks weren't being processed correctly, I've confirmed the connection was lost and I had to redeploy both services to fix it.
How can we avoid this happening again in the future?
Attachments
6 Replies
I've opened several tickets about issues similar to these before and ive received no response ever on what I can do to mitigate these issues with my celery worker. Last 2 times it was because the container was killed but it showed as deployed, other people confirmed it's happened to them before.
21 days ago
Unfortunately, those issues are related to Railway, and nothing can be done to prevent them from happening. On your end, ensure that your Celery worker has reconnection mechanisms in place for your Redis connection. My guess is that your Celery worker loses connection when Redis restarts and doesn't reconnect automatically.
https://discord.com/channels/713503345364697088/1434539248840867972
https://discord.com/channels/713503345364697088/1486345271322607787
Alright I mean I can obviously add reconnection handling but I've linked 2 threads here to show that our celery continues to experience issues regardless of what the actual cause is, it seems in every thread im just being told I can't do anything about it apart from pray railway doesnt go down...
So just not ideal at all, if you have any advice or anything, just want to avoid these issues, as having our celery worker down impacts our most important integration
21 days ago
Those two threads doesn't look like a Railway issue, but would love to be proven wrong 🙂
21 days ago
You could try using replicas. Spin up one or two replicas of your worker service to ensure that if one goes down, it doesn't affect the others. This approach, combined with proper reconnection handling, should improve reliability.
For Redis, there's the option of a High Availability (HA) Cluster, where if one Redis instance fails, the others can still process the work. https://railway.com/deploy/redis-ha-w-cluster-mode-sharded-33