Container receives SIGTERM ~35s after start on every git push deploy — even with no healthcheck

Anonymous

PROOP

4 months ago

Every git push deploy to my marketing-dashboard service results in SIGTERM ~35 seconds after container start. Manual restart from the Railway dashboard always works instantly and the service runs stable indefinitely until the next git push deploy.

Repo: (main, auto-deploy)

Region: US East | Plan: Pro | Replicas: 2 | Builder: Nixpacks (Metal enabled)

What happens on every git push deploy:

1. Container starts, binds to PORT 8080 in <1 second

2. All env vars present, health endpoint responds

3. ~35 seconds later, Railway sends SIGTERM

4. Both replicas killed simultaneously (not rolling)

5. Manual restart from dashboard → works perfectly every time

Critical finding: I completely removed healthcheckPath and healthcheckTimeout from railway.toml and deployed. Same behavior — SIGTERM at ~35s. This confirms the healthcheck is not the cause.

Start command:node server.js (no npm/yarn wrapper)

Health endpoint: Simple Express route, no auth middleware, no external calls — returns JSON immediately. Registered before all authenticated routes.

What I've ruled out:

- Healthcheck failing → removed entirely, same behavior

- Healthcheck timeout → increased to 300s, same behavior

- Restart policy → ALWAYS with 10 retries

- No overlap → added 30s overlap + 15s draining

- Single replica → scaled to 2, both killed simultaneously

- Builder issue → tried Metal, same behavior

- PORT mismatch → confirmed 8080 matches domain target

- npm intercepting SIGTERM → using node server.js directly

Key questions:

1. Why does Railway send SIGTERM ~35s after start with NO healthcheck configured?

2. What differs between git push deploy and manual restart?

3. Why are both replicas killed simultaneously?

See attached file for full deploy logs, railway.toml, code snippets, and failed deploy commit hashes.

Similar thread:https://station.railway.com/questions/deployment-shows-success-but-container-s-b011f12e (same symptom, but their root cause doesn't apply — our app binds in <1s)

$20 Bounty

4 Replies

brody

EMPLOYEE

4 months ago

The old container is getting SIGKILLed, not the new container.

Status changed to Awaiting User Response Railway • 4 months ago

Railway

BOT

4 months ago

This thread has been marked as public for community involvement, as it does not contain any sensitive or personal information. Any further activity in this thread will be visible to everyone.

Status changed to Open Railway • 4 months ago

brody

The old container is getting SIGKILLed, not the new container.

Anonymous

PROOP

4 months ago

Here is the support evidence.

Attachments

railway-sup...

Anonymous

PROOP

4 months ago

I understand the old container is the one getting SIGKILLed — but I don't think that fully explains what I'm seeing.

Here's the current behavior after multiple deploys today:

Container starts, and within 7 seconds it's stopped — no application logs at all (no "Running on port" message, no startup output). The app binds to PORT in under 1 second normally.
This happens on fresh deploymentRestart calls too (via the Railway API), not just git push deploys. So there's no "old container" in that scenario — it's a clean restart.
The code passes node --check cleanly (no syntax errors), and the health endpoint is trivial (no auth, no dependencies).
I've now tested with and without healthcheck, with 1 and 2 replicas, with overlap/draining configured, and with different restart policies. All produce the same ~7s start-stop cycle.

The only thing that breaks the cycle is a manual restart from the Railway dashboard. After that, the exact same code runs fine for hours.

If the old container SIGKILL is expected behavior, what would cause the new container to never get a chance to run? Is there something in the deploy orchestration that could get stuck in a loop where it keeps killing containers before they can serve traffic?

Status changed to Open brody • 4 months ago

akashsalan

FREE

a month ago

I think the investigation is focused on the wrong layer. The key clue is that later deploys show the container stopping within ~7 seconds with no application startup logs at all.

If there are no logs from node server.js, then the process may not actually be reaching application startup before termination.

I’d verify:

Exit code of the container
Whether PID 1 exits immediately
Whether the process is receiving SIGTERM vs exiting naturally
Whether an OOM event is being recorded
Whether deploy orchestration is failing readiness before logs appear

Since a manual restart of the exact same image runs successfully for hours, that suggests the image itself is valid and points more toward deployment orchestration/readiness state than application code.

Can you provide:

Container exit codes
First 30 seconds of platform logs
Whether Railway shows “crashed”, “unhealthy”, or “stopped”
Output of a startup wrapper such as:

echo "START $(date)"

node server.js

echo "EXIT $? $(date)"

That would help distinguish between process exit, orchestrator termination, and health/readiness failure.

Welcome!