Deployment shows SUCCESS but container silently dies with zero runtime logs — restart always fixes it
im-el-bigote
PROOP

15 days ago

Summary

  Every deployment to my service builds successfully, reports SUCCESS status, but the container immediately becomes

  unresponsive (502) with absolutely zero runtime logs captured. Running railway restart on the exact same deployment

  always brings the app back perfectly. This is 100% reproducible across 15+ consecutive deploys, two different regions,

   and multiple Dockerfile configurations.

ProjectDetails

  - ProjectID: ad064a9e-1014-45d4-8a18-01a19ab78ffc

  - Service: concierge (service ID: a02a89ba-84fc-4cee-9ee4-0bcc663dd174)

  - Plan: Pro

  - Builder: Dockerfile (Python 3.11-slim, multi-stage)

  - App: Python FastAPI / uvicorn

  - Runtimememory: ~141MB (flat/declining, well within limits)

ReproductionSteps

  1. git push origin main (triggers auto-deploy)

  2. Build completes successfully (~5-35s depending on cache)

  3. Deployment status shows SUCCESS

  4. All requests to the service return 502

  5. railway logs shows zeroruntimeoutput — not a single line

  6. Run railway restart --yes

  7. App comes up healthy within 60 seconds, logs appear normally

  This cycle repeats on every single deploy.

EvidenceThatThisIsaPlatformIssue

  1. Memoryisnottheissue: Runtime usage is ~141MB, flat/declining. No spike visible in metrics during deploy. Pro

  plan with plenty of headroom.

  2. Theapplicationcodeworksperfectly: railway restart proves the exact same Docker image runs fine. Same container,

   same dependencies, same env vars — just a process restart vs a container swap.

  3. Zerologs=processkilledbeforetheentrypointruns: My startup script (start.py) prints as its very first action

   with flush=True, and I have ENV PYTHONUNBUFFERED=1 in the Dockerfile. The first print statement never appears in

  deploy logs, meaning the process is SIGKILL'd before the Python interpreter executes a single line.

  4. DeployvsRestartistheonlyvariable: Deploy creates a NEW container from the built image. Restart restarts the

  process within the EXISTING container. The container creation/swap step is where the failure occurs.

  5. Notregion-specific: Reproduced in both us-west1 and us-east4.

  6. Notabuildissue: Build always succeeds. Docker layers are valid. The image is the same one that runs successfully

   after railway restart.

WhatI'veTried(noneresolvedit)

  - Multi-stage Dockerfile (separate build/runtime stages to reduce runtime memory)

  - Single-stage Dockerfile

  - Adding overlapSeconds=15 and drainingSeconds=10 to railway.toml

  - Removing overlap settings (back to defaults)

  - Adding a 3-second startup delay before Python imports

  - Changing healthcheck to return 503 until DB is ready

  - Reverting healthcheck to always return 200

  - Adding EXPOSE 8080 to Dockerfile

  - Removing duplicate config files (Procfile, extra railway.toml, extra Dockerfile)

  - Deleting orphan services from the project (went from 3 services to 1)

  - Switching region from us-west1 to us-east4

CurrentConfiguration

railway.toml:

  [build]

  builder = "dockerfile"

  dockerfilePath = "Dockerfile"

  [deploy]

  healthcheckPath = "/health"

  healthcheckTimeout = 60

  restartPolicyType = "ON_FAILURE"

  restartPolicyMaxRetries = 5

  drainingSeconds = 10

Dockerfile:

  FROM python:3.11-slim AS builder

  WORKDIR /app

  COPY backend/requirements.txt .

  RUN pip install --no-cache-dir -r requirements.txt

  FROM python:3.11-slim

  ENV PYTHONUNBUFFERED=1

  WORKDIR /app

  COPY --from=builder /usr/local/lib/python3.11/site-packages /usr/local/lib/python3.11/site-packages

  COPY --from=builder /usr/local/bin /usr/local/bin

  COPY backend/ .

  EXPOSE 8080

  CMD ["python", "start.py"]

start.py(firstlines):

  import time, os

  if os.getenv("RAILWAY_ENVIRONMENT"):

      print("Railway detected, waiting 3s for container setup...", flush=True)

      time.sleep(3)

  print("=== PBJ Start Wrapper ===", flush=True)

  # ... imports and uvicorn.run()

  None of these print statements appear in deploy logs. They all appear after railway restart.

Request

  Could you investigate the container-level events (OOM kills, cgroup limits, SIGKILL signals) for the deployment IDs

  listed above? Specifically:

  1. What is killing the process before the entrypoint executes during a fresh deploy?

  2. Why does the deployment report SUCCESS if the container immediately dies?

  3. Is there a difference in resource allocation between deploy-created containers and restart-recycled containers?

  4. Are there any known issues with the Dockerfile builder and container swap that could explain this?

  The current workaround (railway restart after every deploy) works but is not sustainable. Happy to provide any

  additional information needed.

Solved$20 Bounty

Pinned Solution

pavankumar2812
FREE

13 days ago

I think the issue may be caused by the container failing the health check before the application finishes starting.

During a fresh deploy Railway creates a new container and immediately begins health checks. If the FastAPI/uvicorn process hasn't bound to the PORT yet, Railway may mark the container unhealthy and terminate it before any logs are flushed. This would explain why no runtime logs appear and why railway restart works — the restart happens after the container environment is already initialized so startup completes faster.

A few things to verify:

1. Ensure uvicorn binds to Railway's dynamic port:

uvicorn app:app --host 0.0.0.0 --port $PORT

2. Increase startup tolerance or temporarily disable health checks to confirm:

healthcheckTimeout = 120

3. Confirm the container is not exiting due to missing runtime dependencies. In multi-stage builds it's safer to copy the full Python environment:

COPY --from=builder /usr/local /usr/local

instead of copying only site-packages.

4. Another option is running uvicorn directly as the container entrypoint instead of a Python wrapper script so the server binds earlier.

If the container is being terminated by the platform health check before the process binds to the port, this behavior would match exactly what is happening here.

4 Replies

15 days ago

Can you please explain the issue in one sentence without using an LLM?


Status changed to Awaiting User Response Railway 15 days ago


ray-chen

Can you please explain the issue in one sentence without using an LLM?

im-el-bigote
PROOP

14 days ago

It crashes when i deploy and i have to manually restart every time.


Status changed to Awaiting Railway Response Railway 14 days ago


Railway
BOT

14 days ago

This thread has been marked as public for community involvement, as it does not contain any sensitive or personal information. Any further activity in this thread will be visible to everyone.

Status changed to Open Railway 14 days ago


pavankumar2812
FREE

13 days ago

I think the issue may be caused by the container failing the health check before the application finishes starting.

During a fresh deploy Railway creates a new container and immediately begins health checks. If the FastAPI/uvicorn process hasn't bound to the PORT yet, Railway may mark the container unhealthy and terminate it before any logs are flushed. This would explain why no runtime logs appear and why railway restart works — the restart happens after the container environment is already initialized so startup completes faster.

A few things to verify:

1. Ensure uvicorn binds to Railway's dynamic port:

uvicorn app:app --host 0.0.0.0 --port $PORT

2. Increase startup tolerance or temporarily disable health checks to confirm:

healthcheckTimeout = 120

3. Confirm the container is not exiting due to missing runtime dependencies. In multi-stage builds it's safer to copy the full Python environment:

COPY --from=builder /usr/local /usr/local

instead of copying only site-packages.

4. Another option is running uvicorn directly as the container entrypoint instead of a Python wrapper script so the server binds earlier.

If the container is being terminated by the platform health check before the process binds to the port, this behavior would match exactly what is happening here.


pavankumar2812

I think the issue may be caused by the container failing the health check before the application finishes starting.During a fresh deploy Railway creates a new container and immediately begins health checks. If the FastAPI/uvicorn process hasn't bound to the PORT yet, Railway may mark the container unhealthy and terminate it before any logs are flushed. This would explain why no runtime logs appear and why railway restart works — the restart happens after the container environment is already initialized so startup completes faster.A few things to verify:1. Ensure uvicorn binds to Railway's dynamic port:uvicorn app:app --host 0.0.0.0 --port $PORT2. Increase startup tolerance or temporarily disable health checks to confirm:healthcheckTimeout = 1203. Confirm the container is not exiting due to missing runtime dependencies. In multi-stage builds it's safer to copy the full Python environment:COPY --from=builder /usr/local /usr/localinstead of copying only site-packages.4. Another option is running uvicorn directly as the container entrypoint instead of a Python wrapper script so the server binds earlier.If the container is being terminated by the platform health check before the process binds to the port, this behavior would match exactly what is happening here.

im-el-bigote
PROOP

12 days ago

Root cause: Container was being killed by Railway health checks before uvicorn could bind to the PORT.                

Changes made:

  1. Removedstartupdelaysstart.py had a 3-second sleep and verbose import checks that delayed port binding.        

  Stripped it down to just running uvicorn.                                                                             

  2. DockerfileCMDbindsto$PORTdirectly — Switched to shell-form CMD (uvicorn app.main:app --host 0.0.0.0 --port    

  ${PORT:-8080}) so the server reads Railway's dynamic PORT immediately.                                                

  3. Fixedmulti-stagebuild — Changed COPY --from=builder to copy the full /usr/local directory instead of just        

  site-packages and /usr/local/bin, which was missing some Python binaries.                                             

  4. Increasedhealthchecktimeout — Bumped healthcheckTimeout from 60s to 120s in railway.toml.                       

  5. Removedblockingstartupwork — Supabase client is created on startup but no longer runs a test query that blocks  

  the server from being ready. Health endpoint always returns healthy so Railway's liveness check passes immediately.   

  6. Removedthedeploy-restart.yml GitHubActionsworkaround — No longer needed since deploys now succeed on their own.

Result: Deploy succeeds without needing railway restart.


Status changed to Solved sam-a 12 days ago


Loading...