Container hangs silently at boot — only "Starting Container" log, local with same image+env vars boots fine

nudocowork

HOBBYOP

2 months ago

Summary

New backend deployments build OK, container starts (Railway logs "Starting Container"), but the application produces zero stdout afterwards. Healthcheck fails for 10 min ("service unavailable") and the deploy is marked failed. The previously deployed container (running since 2026-05-17 04:17 AM GMT-5) is still Online and serving production normally.

Smoking gun: The same image with the same env vars boots fully when I run it locally via railway run --service backend -- node dist/main.js. So the bug is not the code or env vars — it appears to be how the new container instance is being created/run on Railway's side.

Evidence

Failed deployment: 6d185348-0dd3-4780-9de9-78e923966d1f
Image digest: sha256:b61b8093814ba998d3e497dfd48bdbd79c22357db77f13b4c6bd5168c803818f
Container instance status: EXITED

Captured Deploy Logs (full):


2026-05-17 15:36:40  Starting Container

(nothing else for 10 minutes)

startCommand: echo '>>> starting nest (no migrate)' ; node dist/main.js — the echo line never appears either.
Build Logs: build succeeds end-to-end (Docker multi-stage, image exported and pushed).
Healthcheck: 18 attempts, all "service unavailable", retry window 10 min.

Railway's auto-diagnose said "process stayed alive but idle" — based on stale Coupon-table errors in old container logs (those migrations have since been applied to the DB; not the cause).

Local proof the binary works

$ railway run --service backend -- node dist/main.js
[Boot] main.ts top — node version=v20.18.1
[Boot] instrument.ts top / done
[Boot] >>> NestFactory.create
[Nest] LOG [NestFactory] Starting Nest application...
[Nest] LOG [InstanceLoader] AppConfigModule dependencies initialized
... (50+ modules all OK)
[Boot] >>> NestFactory.create OK
[Boot] >>> app.listen 4949
[Nest] LOG [PrismaService] Prisma connected · multi-tenant middleware activo
[Nest] LOG [QueueService] BullMQ habilitado contra redis://...@tramway.proxy.rlwy.net:54113
[Nest] LOG [NestApplication] Nest application successfully started
$ curl http://localhost:4949/api/health
{"ok":true,...}

Same image, same env vars (verified via railway variables --service backend --kv). The binary boots fine outside Railway's runtime; it doesn't boot inside.

What we already tried (no effect)

Reset DATABASE_URL using template refs: postgresql://${{Postgres-Nq8w.PGUSER}}:${{Postgres-Nq8w.POSTGRES_PASSWORD}}@${{Postgres-Nq8w.RAILWAY_PRIVATE_DOMAIN}}:${{Postgres-Nq8w.PGPORT}}/${{Postgres-Nq8w.PGDATABASE}}
Switched DATABASE_URL to public proxy: ${{Postgres-Nq8w.DATABASE_PUBLIC_URL}} → resolves to tramway.proxy.rlwy.net:39155.
Switched REDIS_URL to public proxy: ${{Redis.REDIS_PUBLIC_URL}} → resolves to tramway.proxy.rlwy.net:54113.
Applied all pending Prisma migrations to DB (MenuTranslation, Card.minAmountPerStamp, drop Coupon, QuotePlan enum).
TCP reachability check from outside: nc -zv tramway.proxy.rlwy.net 39155 → ok; nc -zv tramway.proxy.rlwy.net 54113 → ok.
No env var in backend service references *.railway.internal (other than RAILWAY_PRIVATE_DOMAIN which is the backend's own private name).

What's not the cause

App code: local proves it.
DB schema: all migrations applied, verified.
Env vars: identical to what local uses successfully.
Public proxies: reachable from outside Railway.
Image: builds cleanly, same digest works locally.

Questions for Railway

Why does the container for 6d185348-0dd3-4780-9de9-78e923966d1f produce zero stdout after "Starting Container"? Is the process actually running? Is its stdout connected to the log pipe?
Is there an egress / network policy preventing this service's new containers from reaching tramway.proxy.rlwy.net:39155 and :54113?
Should we recreate the backend service from scratch with new IDs? If so, can the api.soyclubify.com custom domain be moved without downtime?
Anything in your kernel / container runtime logs that would explain a silent stall when our process boots in ~5 seconds locally?

Production is not affected (old container, uptime ~20h, still serving). We can wait — but the new code (which depends on these migrations and includes a translation feature) can only ship once we get a new container up.

Thanks!

$10 Bounty

2 Replies

Status changed to Open Railway • about 2 months ago

darseen

HOBBYTop 1% Contributor

2 months ago

Your application won't be started until the healthcheck passes. That's why you don't see any logs.

The most common cause for healthcheck service unavailable error is not listening on the PORT variable or omitting it when using target ports which can result in your health check returning a service unavailable error.

gyanavkhandelwal6396-cmyk

FREE

2 months ago

Most likely causes are a stuck container runtime, broken stdout attachment, or networking/init failure during container startup; recreating the backend service (new service ID/container lineage) is probably the fastest path while Railway investigates deployment 6d185348-0dd3-4780-9de9-78e923966d1f at the platform level.

this is a Railway runtime/container issue rather than an application problem

Welcome!