15 days ago
Summary
Every deployment to my service builds successfully, reports SUCCESS status, but the container immediately becomes
unresponsive (502) with absolutely zero runtime logs captured. Running railway restart on the exact same deployment
always brings the app back perfectly. This is 100% reproducible across 15+ consecutive deploys, two different regions,
and multiple Dockerfile configurations.
ProjectDetails
- ProjectID: ad064a9e-1014-45d4-8a18-01a19ab78ffc
- Service: concierge (service ID: a02a89ba-84fc-4cee-9ee4-0bcc663dd174)
- Plan: Pro
- Builder: Dockerfile (Python 3.11-slim, multi-stage)
- App: Python FastAPI / uvicorn
- Runtimememory: ~141MB (flat/declining, well within limits)
ReproductionSteps
1. git push origin main (triggers auto-deploy)
2. Build completes successfully (~5-35s depending on cache)
3. Deployment status shows SUCCESS
4. All requests to the service return 502
5. railway logs shows zeroruntimeoutput — not a single line
6. Run railway restart --yes
7. App comes up healthy within 60 seconds, logs appear normally
This cycle repeats on every single deploy.
EvidenceThatThisIsaPlatformIssue
1. Memoryisnottheissue: Runtime usage is ~141MB, flat/declining. No spike visible in metrics during deploy. Pro
plan with plenty of headroom.
2. Theapplicationcodeworksperfectly: railway restart proves the exact same Docker image runs fine. Same container,
same dependencies, same env vars — just a process restart vs a container swap.
3. Zerologs=processkilledbeforetheentrypointruns: My startup script (start.py) prints as its very first action
with flush=True, and I have ENV PYTHONUNBUFFERED=1 in the Dockerfile. The first print statement never appears in
deploy logs, meaning the process is SIGKILL'd before the Python interpreter executes a single line.
4. DeployvsRestartistheonlyvariable: Deploy creates a NEW container from the built image. Restart restarts the
process within the EXISTING container. The container creation/swap step is where the failure occurs.
5. Notregion-specific: Reproduced in both us-west1 and us-east4.
6. Notabuildissue: Build always succeeds. Docker layers are valid. The image is the same one that runs successfully
after railway restart.
WhatI'veTried(noneresolvedit)
- Multi-stage Dockerfile (separate build/runtime stages to reduce runtime memory)
- Single-stage Dockerfile
- Adding overlapSeconds=15 and drainingSeconds=10 to railway.toml
- Removing overlap settings (back to defaults)
- Adding a 3-second startup delay before Python imports
- Changing healthcheck to return 503 until DB is ready
- Reverting healthcheck to always return 200
- Adding EXPOSE 8080 to Dockerfile
- Removing duplicate config files (Procfile, extra railway.toml, extra Dockerfile)
- Deleting orphan services from the project (went from 3 services to 1)
- Switching region from us-west1 to us-east4
CurrentConfiguration
railway.toml:
[build]
builder = "dockerfile"
dockerfilePath = "Dockerfile"
[deploy]
healthcheckPath = "/health"
healthcheckTimeout = 60
restartPolicyType = "ON_FAILURE"
restartPolicyMaxRetries = 5
drainingSeconds = 10
Dockerfile:
FROM python:3.11-slim AS builder
WORKDIR /app
COPY backend/requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
FROM python:3.11-slim
ENV PYTHONUNBUFFERED=1
WORKDIR /app
COPY --from=builder /usr/local/lib/python3.11/site-packages /usr/local/lib/python3.11/site-packages
COPY --from=builder /usr/local/bin /usr/local/bin
COPY backend/ .
EXPOSE 8080
CMD ["python", "start.py"]
start.py(firstlines):
import time, os
if os.getenv("RAILWAY_ENVIRONMENT"):
print("Railway detected, waiting 3s for container setup...", flush=True)
time.sleep(3)
print("=== PBJ Start Wrapper ===", flush=True)
# ... imports and uvicorn.run()
None of these print statements appear in deploy logs. They all appear after railway restart.
Request
Could you investigate the container-level events (OOM kills, cgroup limits, SIGKILL signals) for the deployment IDs
listed above? Specifically:
1. What is killing the process before the entrypoint executes during a fresh deploy?
2. Why does the deployment report SUCCESS if the container immediately dies?
3. Is there a difference in resource allocation between deploy-created containers and restart-recycled containers?
4. Are there any known issues with the Dockerfile builder and container swap that could explain this?
The current workaround (railway restart after every deploy) works but is not sustainable. Happy to provide any
additional information needed.
Pinned Solution
13 days ago
I think the issue may be caused by the container failing the health check before the application finishes starting.
During a fresh deploy Railway creates a new container and immediately begins health checks. If the FastAPI/uvicorn process hasn't bound to the PORT yet, Railway may mark the container unhealthy and terminate it before any logs are flushed. This would explain why no runtime logs appear and why railway restart works — the restart happens after the container environment is already initialized so startup completes faster.
A few things to verify:
1. Ensure uvicorn binds to Railway's dynamic port:
uvicorn app:app --host 0.0.0.0 --port $PORT
2. Increase startup tolerance or temporarily disable health checks to confirm:
healthcheckTimeout = 120
3. Confirm the container is not exiting due to missing runtime dependencies. In multi-stage builds it's safer to copy the full Python environment:
COPY --from=builder /usr/local /usr/local
instead of copying only site-packages.
4. Another option is running uvicorn directly as the container entrypoint instead of a Python wrapper script so the server binds earlier.
If the container is being terminated by the platform health check before the process binds to the port, this behavior would match exactly what is happening here.
4 Replies
Status changed to Awaiting User Response Railway • 15 days ago
ray-chen
Can you please explain the issue in one sentence without using an LLM?
14 days ago
It crashes when i deploy and i have to manually restart every time.
Status changed to Awaiting Railway Response Railway • 14 days ago
14 days ago
This thread has been marked as public for community involvement, as it does not contain any sensitive or personal information. Any further activity in this thread will be visible to everyone.
Status changed to Open Railway • 14 days ago
13 days ago
I think the issue may be caused by the container failing the health check before the application finishes starting.
During a fresh deploy Railway creates a new container and immediately begins health checks. If the FastAPI/uvicorn process hasn't bound to the PORT yet, Railway may mark the container unhealthy and terminate it before any logs are flushed. This would explain why no runtime logs appear and why railway restart works — the restart happens after the container environment is already initialized so startup completes faster.
A few things to verify:
1. Ensure uvicorn binds to Railway's dynamic port:
uvicorn app:app --host 0.0.0.0 --port $PORT
2. Increase startup tolerance or temporarily disable health checks to confirm:
healthcheckTimeout = 120
3. Confirm the container is not exiting due to missing runtime dependencies. In multi-stage builds it's safer to copy the full Python environment:
COPY --from=builder /usr/local /usr/local
instead of copying only site-packages.
4. Another option is running uvicorn directly as the container entrypoint instead of a Python wrapper script so the server binds earlier.
If the container is being terminated by the platform health check before the process binds to the port, this behavior would match exactly what is happening here.
pavankumar2812
I think the issue may be caused by the container failing the health check before the application finishes starting.During a fresh deploy Railway creates a new container and immediately begins health checks. If the FastAPI/uvicorn process hasn't bound to the PORT yet, Railway may mark the container unhealthy and terminate it before any logs are flushed. This would explain why no runtime logs appear and why railway restart works — the restart happens after the container environment is already initialized so startup completes faster.A few things to verify:1. Ensure uvicorn binds to Railway's dynamic port:uvicorn app:app --host 0.0.0.0 --port $PORT2. Increase startup tolerance or temporarily disable health checks to confirm:healthcheckTimeout = 1203. Confirm the container is not exiting due to missing runtime dependencies. In multi-stage builds it's safer to copy the full Python environment:COPY --from=builder /usr/local /usr/localinstead of copying only site-packages.4. Another option is running uvicorn directly as the container entrypoint instead of a Python wrapper script so the server binds earlier.If the container is being terminated by the platform health check before the process binds to the port, this behavior would match exactly what is happening here.
12 days ago
Root cause: Container was being killed by Railway health checks before uvicorn could bind to the PORT.
Changes made:
1. Removedstartupdelays — start.py had a 3-second sleep and verbose import checks that delayed port binding.
Stripped it down to just running uvicorn.
2. DockerfileCMDbindsto$PORTdirectly — Switched to shell-form CMD (uvicorn app.main:app --host 0.0.0.0 --port
${PORT:-8080}) so the server reads Railway's dynamic PORT immediately.
3. Fixedmulti-stagebuild — Changed COPY --from=builder to copy the full /usr/local directory instead of just
site-packages and /usr/local/bin, which was missing some Python binaries.
4. Increasedhealthchecktimeout — Bumped healthcheckTimeout from 60s to 120s in railway.toml.
5. Removedblockingstartupwork — Supabase client is created on startup but no longer runs a test query that blocks
the server from being ready. Health endpoint always returns healthy so Railway's liveness check passes immediately.
6. Removedthedeploy-restart.yml GitHubActionsworkaround — No longer needed since deploys now succeed on their own.
Result: Deploy succeeds without needing railway restart.
Status changed to Solved sam-a • 12 days ago