7 days ago
Project: avengers initiative (09cd849a-c0b8-4fac-803e-49a42c06220f)
Environment: production
Service: avengers (95b8e389-ea41-4e52-9e67-90b0855fb681)
Latest deploy: 16a93ad2, status Completed
Public domain: avengers-production-b54b.up.railway.app
Symptom:
- /health returns 502/000 to clients
- - HTTP Logs: all requests return 499 with ~11s duration (edge holds connection, client times out)
- - Deploy Logs: only "Starting Container", no further output
- - Network Flow Logs: empty (zero outbound connections from the container)
- - Deploy marked "Completed" but app does not serve traffic
- - Persists across multiple redeploys today
What I verified is fine:
- Code: runs perfectly via "railway run" locally with production env (full boot, all services connect, ~112 MB RSS)
- - Env vars: all set correctly (verified by railway run)
- - Build: SUCCESS, image 246.5 MB
- - Target port: explicitly set to 8080 in Networking (was None before)
- - PORT env var: set to 8080
- - Builder: Dockerfile (/Dockerfile, ENTRYPOINT /usr/bin/tini --, CMD python main.py)
- - Start command from railway.toml: python -u scripts/apply_migrations.py && python -u main.py
- - Restart policy: ON_FAILURE, max retries 10
- - Same commit (22b77f89) ran successfully for 3 days (deploy bc56e94c, May 11 19:24 to May 14 14:55 EEST)
- - Locally reproducing the exact start command with production env + working DB: app boots fully, listens on 8080
Timeline of break:
- 14:55 EEST today: /health returned 200 on deploy bc56e94c
- - 15:06 EEST: new deploy 5eed5da3 happened (same commit, trigger unknown to me)
- - 15:34 EEST onward: multiple redeploys, all show the same symptom
What is unusual:
- Completed status with no app activity. With ON_FAILURE restart policy, a crashing process would restart up to 10x. We see only one Starting Container, so the process appears to exit 0 cleanly with zero output, or never executes the start command.
- - Earlier deploys today (89910def, fc0e0ce7, 282f0240) showed [migrate] all up to date line. The latest deploy 16a93ad2 does not even show that, just Starting Container.
Please check internal logs for deploy 16a93ad2:
- Container process exit status
- 2. Exact command tini executed
- 3. Any cgroup OOM kills, resource throttling, or platform-side terminations not in user logs
- 4. Why the migrate log line appears for some recent deploys but not the latest
App is functional (proven via railway run). The container deployment is failing silently and user-visible observability shows nothing actionable.
Thank you.
2 Replies
Status changed to Open Railway • 7 days ago
7 days ago
I'd try removing ENTRYPOINT in your Dockerfile. Also, I'd recommend choosing either using CMD or specifying the start command in your CaC file. Not 100% sure but it may be fighting each other.
Another note, if your service is truly online, it should say "Online" instead of "Completed."
7 days ago
▎ The clue that jumps out most from your post is earlier deploys today showed migration logs, the latest one doesnt — same
▎ commit, same Dockerfile, no code change between em. That points at Railway-side config drift (start command / env vars)
▎ more than the image itself.
▎
▎ Exit 0 + "Completed" status + no log output + no outbound traffic is the signature of a start command that silently no-ops,
▎ which can happen a handful of ways. worth checking in this order:
▎
▎ 1. Custom Start Command in the dashboard. Settings → Deploy → Custom Start Command. Open an older successful deploy from
▎ the history and compare its config to the current one. silent-fail patterns:
▎ - Field got cleared → tini -- runs with nothing → exits 0 immediately
▎ - Backgrounded process (python main.py &) → shell returns 0, container ends
▎ - Typo'd module / path → Python can exit before stdout is flushed
▎
▎ 2. PYTHONUNBUFFERED=1 env var. If its not set Python buffers stdout, so a crash + exit before flush gives you exactly what
▎ your seeing: "Starting Container" then nothing. Add it regardless — even if its not the root cause you stop flying blind on
▎ the next deplyo.
▎
▎ 3. Variables history. Same dashboard, Variables tab — diff env vars on the last good deploy vs current. Common pattern: a
▎ config validator does sys.exit(0) when an expected env var is missing and the app dies cleanly with no error.
▎
▎ 4. Healthcheck path. 502/000 on /health is a consequence of nothing listening, not the cause. Worth a quick glance to
▎ confirm /health is what the app actually exposes, but its not where the bug lives.
▎
▎ On the suggestion from @0x5b62656e5d about removing the tini ENTRYPOINT — fair instinct and worth keeping in the back
▎ pocket. Railway's Custom Start Command overrides Docker's CMD but ENTRYPOINT ["/tini", "--"] still wraps whatever runs, so
▎ tini should be exec-ing your start command transparently. Id rule out the four items above first since dropping tini costs
▎ you signal handling + zombie reaping — but if none of those land then yeah, trying without the ENTRYPOINT is a reasonable
▎ next step.
▎
▎ A diagnostic that narrows it down in one deploy — temporarily paste this as Custom Start Command:
▎
▎ sh -c "echo BOOT_OK && env | sort && exec python -u main.py"
▎
▎ Three outcomes:
▎ - BOOT_OK + env dump + your normal logs → stdout buffering + a bad normal start command. diff against the working deploy.
▎ - BOOT_OK + env dump, then silent exit 0 → main.py is exiting early. look at top-of-file config checks / sys.exit calls.
▎ - No BOOT_OK at all → cached / corrupt image layer, or the entrypoint angle the mod raised. Trigger a clean rebuild (small
▎ Dockerfile edit + redeploy without cache) and reassess.