Hi People, We're running a Python application on Railway that consists of a main web service (FastAPI) and a separate service for background tasks using **Celery** workers. These workers handle tasks that can sometimes be long-running (e.g., 1-5 minutes), as they involve processing data and making calls to external APIs. We are using **Redis** as the message broker. Our goal is to achieve a graceful, zero-downtime deployment AND no hanging/breaking executions. After doing some research, we believe the solution lies in using two specific environment variables: `RAILWAY_DEPLOYMENT_OVERLAP_SECONDS` and `RAILWAY_DEPLOYMENT_DRAINING_SECONDS`. Our understanding is that these variables work sequentially during a deployment: * Overlap Period (\` `RAILWAY_DEPLOYMENT_OVERLAP_SECONDS`\`): The new deployment is started and runs alongside the old one for this duration. This allows our new Celery workers to initialize and start pulling tasks from the queue. * Draining Period (\` `RAILWAY_DEPLOYMENT_DRAINING_SECONDS`\`): After the overlap period ends, SIGTERM is sent to the old deployment. This variable then defines the grace period for our old workers to finish their active tasks before SIGKILL is sent. **Proposed Configuration Example**: Assuming our longest-running task takes about 3 minutes (180 seconds), we are planning to configure our service with the following values: * RAILWAY\_DEPLOYMENT\_OVERLAP\_SECONDS=30 * RAILWAY\_DEPLOYMENT\_DRAINING\_SECONDS=210 (180s task time + 30s buffer) **Our Questions:** 1\. Is our understanding of how OVERLAP and DRAINING work sequentially correct? 2\. Does our proposed configuration seem like a sound strategy for achieving a graceful shutdown for our Celery workers? 3\. As a safety net, we are also setting task\_acks\_late = True and task\_reject\_on\_worker\_lost = True in our Celery configuration. Do you see any potential conflicts or issues with using this alongside Railway's deployment process? 4\. Are there any other Railway-specific best practices we might be missing for this kind of background worker setup? Thanks in advance for any insights or advice

Graceful Shutdown of Celery Workers During Deployments

fehmisener

PROOP

a year ago

Hi People,

We're running a Python application on Railway that consists of a main web service (FastAPI) and a separate service for background tasks using Celery workers. These workers handle tasks that can sometimes be long-running (e.g., 1-5 minutes), as they involve processing data and making calls to external APIs. We are using Redis as the message broker.

Our goal is to achieve a graceful, zero-downtime deployment AND no hanging/breaking executions. After doing some research, we believe the solution lies in using two specific environment variables: RAILWAY_DEPLOYMENT_OVERLAP_SECONDS and RAILWAY_DEPLOYMENT_DRAINING_SECONDS.

Our understanding is that these variables work sequentially during a deployment:

Overlap Period (` RAILWAY_DEPLOYMENT_OVERLAP_SECONDS`): The new deployment is started and runs alongside the old one for this duration. This allows our new Celery workers to initialize and start pulling tasks from the queue.
Draining Period (` RAILWAY_DEPLOYMENT_DRAINING_SECONDS`): After the overlap period ends, SIGTERM is sent to the old deployment. This variable then defines the grace period for our old workers to finish their active tasks before SIGKILL is sent.

Proposed Configuration Example:

Assuming our longest-running task takes about 3 minutes (180 seconds), we are planning to configure our service with the following values:

RAILWAY_DEPLOYMENT_OVERLAP_SECONDS=30
RAILWAY_DEPLOYMENT_DRAINING_SECONDS=210 (180s task time + 30s buffer)

Our Questions:

1. Is our understanding of how OVERLAP and DRAINING work sequentially correct?

2. Does our proposed configuration seem like a sound strategy for achieving a graceful shutdown for our Celery workers?

3. As a safety net, we are also setting task_acks_late = True and task_reject_on_worker_lost = True in our Celery configuration. Do you see any potential conflicts or issues with using this alongside Railway's deployment process?

4. Are there any other Railway-specific best practices we might be missing for this kind of background worker setup?

Thanks in advance for any insights or advice

$10 Bounty

2 Replies

Railway

BOT

a year ago

Hey there! We've found the following might help you get unblocked faster:

If you find the answer from one of these, please let us know by solving the thread!

oko-tester

HOBBY

a year ago

Hi,

From your description, I suspect the core issue may not be in your choice of RAILWAY_DEPLOYMENT_OVERLAP_SECONDS and RAILWAY_DEPLOYMENT_DRAINING_SECONDS values, but rather in how Railway’s deployment process interacts with Celery workers.

A few important points:

Overlap + Draining aren’t a guaranteed “task-safe” sequence While OVERLAP_SECONDS does run both old and new deployments in parallel, once that time ends, Railway sends SIGTERM to the old deployment. DRAINING_SECONDS simply defines how long the process is allowed to stay alive after receiving SIGTERM before it gets a hard SIGKILL. If your Celery workers aren’t explicitly trapping and handling SIGTERM (e.g., via the worker_shutdown signal), long-running tasks will still be terminated early.
Both old and new workers may pull from the queue during overlap In the overlap period, both sets of workers can consume tasks from Redis. This can cause race conditions, and the old workers might start new tasks even though they’re about to be terminated.
task_acks_late + task_reject_on_worker_lost This is good for requeuing tasks, but if a worker is killed mid-task, another worker may immediately pick it up, leading to possible duplicate external API calls if the task isn’t fully idempotent.
Railway’s kill timer is absolute Even if DRAINING_SECONDS is long enough for most of your tasks, any task exceeding this time will still be killed. There’s no “wait until tasks finish” safeguard — it’s a hard stop.

Recommendation:

Use Celery’s worker_shutdown hook to stop consuming new tasks and let current tasks finish.
Consider splitting your worker deployment from your web service so workers aren’t restarted on every web deployment.
Optionally use a dedicated “draining” queue during deployments so only the new workers pick up new jobs.
Ensure task_time_limit is set so no task exceeds your DRAINING_SECONDS.

In short: Railway will overlap and drain, but it will not guarantee task completion. To truly achieve graceful Celery shutdowns, your worker code needs to explicitly handle shutdown signals and control task consumption during deployments.

Hope this clarifies the behavior and helps you adjust your strategy.

Welcome!