7 months ago
Hi People,
We're running a Python application on Railway that consists of a main web service (FastAPI) and a separate service for background tasks using Celery workers. These workers handle tasks that can sometimes be long-running (e.g., 1-5 minutes), as they involve processing data and making calls to external APIs. We are using Redis as the message broker.
Our goal is to achieve a graceful, zero-downtime deployment AND no hanging/breaking executions. After doing some research, we believe the solution lies in using two specific environment variables: RAILWAY_DEPLOYMENT_OVERLAP_SECONDS and RAILWAY_DEPLOYMENT_DRAINING_SECONDS.
Our understanding is that these variables work sequentially during a deployment:
Overlap Period (`
RAILWAY_DEPLOYMENT_OVERLAP_SECONDS`): The new deployment is started and runs alongside the old one for this duration. This allows our new Celery workers to initialize and start pulling tasks from the queue.Draining Period (`
RAILWAY_DEPLOYMENT_DRAINING_SECONDS`): After the overlap period ends, SIGTERM is sent to the old deployment. This variable then defines the grace period for our old workers to finish their active tasks before SIGKILL is sent.
Proposed Configuration Example:
Assuming our longest-running task takes about 3 minutes (180 seconds), we are planning to configure our service with the following values:
RAILWAY_DEPLOYMENT_OVERLAP_SECONDS=30
RAILWAY_DEPLOYMENT_DRAINING_SECONDS=210 (180s task time + 30s buffer)
Our Questions:
1. Is our understanding of how OVERLAP and DRAINING work sequentially correct?
2. Does our proposed configuration seem like a sound strategy for achieving a graceful shutdown for our Celery workers?
3. As a safety net, we are also setting task_acks_late = True and task_reject_on_worker_lost = True in our Celery configuration. Do you see any potential conflicts or issues with using this alongside Railway's deployment process?
4. Are there any other Railway-specific best practices we might be missing for this kind of background worker setup?
Thanks in advance for any insights or advice
2 Replies
7 months ago
Hey there! We've found the following might help you get unblocked faster:
If you find the answer from one of these, please let us know by solving the thread!
7 months ago
Hi,
From your description, I suspect the core issue may not be in your choice of RAILWAY_DEPLOYMENT_OVERLAP_SECONDS and RAILWAY_DEPLOYMENT_DRAINING_SECONDS values, but rather in how Railway’s deployment process interacts with Celery workers.
A few important points:
Overlap + Draining aren’t a guaranteed “task-safe” sequence
WhileOVERLAP_SECONDSdoes run both old and new deployments in parallel, once that time ends, Railway sendsSIGTERMto the old deployment.DRAINING_SECONDSsimply defines how long the process is allowed to stay alive after receivingSIGTERMbefore it gets a hardSIGKILL. If your Celery workers aren’t explicitly trapping and handlingSIGTERM(e.g., via theworker_shutdownsignal), long-running tasks will still be terminated early.Both old and new workers may pull from the queue during overlap
In the overlap period, both sets of workers can consume tasks from Redis. This can cause race conditions, and the old workers might start new tasks even though they’re about to be terminated.task_acks_late+task_reject_on_worker_lost
This is good for requeuing tasks, but if a worker is killed mid-task, another worker may immediately pick it up, leading to possible duplicate external API calls if the task isn’t fully idempotent.Railway’s kill timer is absolute
Even ifDRAINING_SECONDSis long enough for most of your tasks, any task exceeding this time will still be killed. There’s no “wait until tasks finish” safeguard — it’s a hard stop.
Recommendation:
Use Celery’s
worker_shutdownhook to stop consuming new tasks and let current tasks finish.Consider splitting your worker deployment from your web service so workers aren’t restarted on every web deployment.
Optionally use a dedicated “draining” queue during deployments so only the new workers pick up new jobs.
Ensure
task_time_limitis set so no task exceeds yourDRAINING_SECONDS.
In short: Railway will overlap and drain, but it will not guarantee task completion. To truly achieve graceful Celery shutdowns, your worker code needs to explicitly handle shutdown signals and control task consumption during deployments.
Hope this clarifies the behavior and helps you adjust your strategy.