Services not restarting automatically after graceful shutdown - Production environment

orkin

PROOP

2 months ago

Hello,

I'm experiencing an issue with two consumer services in my production environment that stopped restarting automatically after graceful shutdowns.

Services affected:

consumer async (ID: 6734e3bc-ef33-45b5-bc6e-4a66f61d7c9f)
consumer policy (ID: 04a8793d-d12a-441d-a28c-cc33ecbcf34b)

Timeline:

consumer async: Last restart at 2026-05-04 22:10:58 UTC, then no logs until 2026-05-06 06:44:37 UTC (32-hour gap)
consumer policy: Last restart at 2026-05-05 16:13:42 UTC, then no logs until 2026-05-06 06:44:52 UTC (14-hour gap)

Configuration: Both services are configured with:

restartPolicyType: ALWAYS
startCommand: /usr/local/bin/docker-entrypoint php bin/console messenger:consume --limit=1000 --time-limit=1800 --memory-limit=512M -vv [async|async_policy]
Region: europe-west4-drams3a
1 replica each

The issue: The services are stopping gracefully due to the --time-limit=1800 parameter (30 minutes), which is expected behavior. However, they are not restarting automatically despite having ALWAYS restart policy configured.

Important note: The exact same configuration works perfectly in my development environment, where the services restart every 30 minutes as expected. There have been no code or configuration changes between the two environments, and no deployments occurred during the gap period.

Expected behavior: Services should restart automatically after graceful shutdown (exit code 0) with ALWAYS restart policy.

Could you investigate why the automatic restart is not working in production while it works correctly in development?

Or if you have any advice to resolve this issue. The restart after 30 minutes is intended for php worker due to sometime language memory leaks.

Thank you

$30 Bounty

3 Replies

Status changed to Awaiting Railway Response Railway • 2 months ago

orkin

PROOP

2 months ago

Hi team,

Following up with additional data I gathered while investigating on my side. The findings strongly suggest this is not a configuration or workload issue, but rather a silent failure of the restart supervisor itself.

Deployment context

Both services were running on a single SUCCESS deployment, deployed on 2026-05-01 at 14:03 UTC and never replaced:

- consumer async: deployment 3d142cb7-6725-4982-9934-8d0fd7156379

- consumer policy: deployment 9ccc5119-79bd-4185-b5d9-15e433e59d99

The deployment status remained SUCCESS for both services throughout the gap. No CRASHED state, no failed deployment, no automatic redeploy attempt, from the API/dashboard perspective, everything looked normal.

Restart mechanism observed

Each messenger:consume graceful exit triggers a full container reboot — the entrypoint runs from scratch every cycle (secrets dump, cache warmup, DB readiness check, Symfony 8.0.8 (env: prod), PHP app ready!, then messenger:consume). I confirmed this in the logs: the pattern repeats reliably every ~30 minutes for several days, exactly as expected with restartPolicyType: ALWAYS.

Timeline of the gap (consumer async)

2026-05-04T21:41:01.614Z PHP app ready! (last normal restart)

2026-05-04T22:11:01.797Z Stopping worker due to time limit of 1800s exceeded

[GAP — 32 hours, zero log lines, no restart attempt]

2026-05-06T06:44:37.586Z PHP app ready! (manual redeploy by me)

Same pattern on consumer policy (gap of 14 hours from 2026-05-05T16:13Z to 2026-05-06T06:44Z).

The critical observation: during the gap, there is not a single log line on either service — no failed boot, no error, no backoff/retry message, no health check failure. The supervisor simply stopped issuing restart commands.

Cycle count before the silent failure

By the time the supervisor stopped, each service had cycled cleanly many times on the same deployment:

consumer async: ~160 successful container restarts (~80h × 2/h)
consumer policy: ~196 successful container restarts (~98h × 2/h)

Hypothesis: restart backoff / cycle limit

This pattern is consistent with an internal restart-frequency throttle or cumulative cycle limit that triggers silently after a high number of rapid graceful exits, rather than with a crash loop detector (which would require non-zero exits and would log a backoff). Specifically:

Is there an undocumented max restart count per deployment or replica after which the supervisor stops scheduling restarts?
Is there a rate-limit on container restarts (e.g. N starts per hour over a sliding window) that, once tripped, requires an external trigger (manual redeploy) to reset?
Could this be related to instance rotation in V2 runtime — where a new underlying instance is supposed to take over and silently fails to receive the start command?
Is the ALWAYS policy strictly tied to the lifetime of a specific replica/instance, such that an internal instance migration breaks the restart loop without surfacing any event?

I'd specifically like to rule out a known/silent quota:

If there is a max-restart-frequency safeguard, please document it (and ideally surface it as a deployment event/log when it triggers).
If our cycle frequency (one restart every 30 minutes per service, sustained) is considered abusive and triggers a silent throttle, that's something we need to know, currently it appears as a complete outage with no signal.

Why the dev/prod difference

Production has been running this deployment uninterrupted since 2026-05-01, accumulating restart cycles. Dev is redeployed much more frequently, which resets the deployment-scoped counter (if such a thing exists) and would explain why the issue never reproduces there.

Workaround on our side

In parallel, we could update the start command on both services to wrap the worker in a shell loop, so the container process itself does not exit on --time-limit. The periodic worker recycling now happens inside the container, which removes our dependency on Railway's restart supervisor for the normal cycle:

sh -c 'while true; do /usr/local/bin/docker-entrypoint php bin/console messenger:consume --limit=1000 --time-limit=1800 --memory-limit=512M -vv async; done'

(async_policy for the policy consumer instead of async.)

With this change, restartPolicyType: ALWAYS becomes a safety net for actual crashes (non-zero exits) rather than the primary recycling mechanism, and the per-day container restart count drops from ~48 to effectively zero in steady state, which should also avoid whatever threshold we appear to have been hitting. Is it something that could work ?

Happy to provide additional log exports or run any diagnostic you need.

Thanks!

Railway

BOT

2 months ago

This thread has been marked as public for community involvement, as it does not contain any sensitive or personal information. Any further activity in this thread will be visible to everyone.

Status changed to Open Railway • 2 months ago

orkin

PROOP

2 months ago

Hi, it could be interesting to do it like this but it's not a native way to it with Symfony Messenger, restart with start commands allow to stop the service without loosing any messages, it's a clean restart. I do not have memory leaks right now and my problem due to the exponential backoff on automatic restarts every 30 minutes. The start command allow to monitory memory usage too and stop the service before memory crash like the template. The template probably do it better for this use case.

jeggob

FREE

2 months ago

Yes, I think your shell-loop workaround is the right direction, but I would not use the raw one-liner exactly as written.

Symfony's own Messenger docs assume there is a process manager above messenger:consume: the worker is allowed to exit because of --limit, --memory-limit, --time-limit, or messenger:stop-workers, and then Supervisor/systemd starts a fresh worker process.

On Railway, using the Railway deployment restart supervisor as that high-frequency process manager means every normal 30-minute recycle becomes a full container exit + platform restart. Your data strongly suggests that this works for a while but is not a great steady-state mechanism.

I would make the container stay alive and recycle only the Symfony worker inside it. Use Railway's ALWAYS restart policy as the safety net for the wrapper process, not for every planned worker recycle.

Something like this is safer than a bare while true:

```bash

sh -lc '

set -eu

trap "kill -TERM ${child:-0} 2>/dev/null || true; wait ${child:-0} 2>/dev/null || true; exit 0" TERM INT

while true; do

/usr/local/bin/docker-entrypoint php bin/console messenger:consume async \

--limit=1000 \

--time-limit=1800 \

--memory-limit=512M \

--failure-limit=3 \

-vv &

child=$!

wait "$child"

code=$?

# Symfony exits 0 for planned recycle conditions like time/message limits.

# Restart those inside the same container.

if [ "$code" -eq 0 ]; then

sleep 2

continue

# Real worker failure: let the container exit non-zero so Railway can restart it.

exit "$code"

done

```

Use the same command for async_policy on the policy consumer.

This preserves the behavior you want:

- graceful Symfony recycle every 30 minutes

- no normal container exit every 30 minutes

- no accumulation of hundreds of platform-level restarts per deployment

- Railway ALWAYS still handles the wrapper/container if it actually dies

- non-zero worker failures still bubble up instead of being hidden forever

If you expect Railway to send termination signals during deploys or host migration, I would also set a non-zero RAILWAY_DEPLOYMENT_DRAINING_SECONDS so the worker has time to finish/stop cleanly. The important part is that the wrapper traps TERMINT and forwards them to the child worker; otherwise the shell can become PID 1 and your worker may not get the stop signal cleanly.

So: yes, the workaround can work, but treat the shell loop as the worker process manager. Let Railway supervise that process manager, not each planned messenger:consume --time-limit exit.

Welcome!