New deployments build successfully but always crash on healthcheck and fallback to old v1.4 with ML. Correct files on GitHub (api\_[app.py](http://app.py) \+ vector\_[loader.py](http://loader.py)). Custom Build Command seems stuck. Please help

Persistent fallback to old v1.4 despite correct new classical api_app.py

marunigno-ship-it

PROOP

22 days ago

New deployments build successfully but always crash on healthcheck and fallback to old v1.4 with ML. Correct files on GitHub (api_app.py + vector_loader.py). Custom Build Command seems stuck. Please help

$20 Bounty

4 Replies

Railway

BOT

22 days ago

This thread has been marked as public for community involvement, as it does not contain any sensitive or personal information. Any further activity in this thread will be visible to everyone.

Status changed to Open Railway • 22 days ago

marunigno-ship-it

PROOP

22 days ago

OK FREN

imsushant2005

FREE

22 days ago

The deployment is not failing at build time. The build completes successfully, but the new release crashes during the health check, so the platform marks it unhealthy and falls back to the previous stable v1.4 deployment.

The likely issue is that the Custom Build/Start Command is not being applied correctly, or the new app is not starting with the expected entrypoint.

Please verify that the deployment is using the latest GitHub files:

api_app.py

vector_loader.py

Also please confirm the service is starting from api_app.py, binding to the correct host/port, and that the health-check endpoint returns HTTP 200.

This is critical because every new deployment builds successfully but fails health check and rolls back to old v1.4.

marunigno-ship-it

PROOP

22 days ago

Thank you for your reply,

@imsushant2005

.We have triple-checked all the points you mentioned:

api_app.py and vector_loader.py are both present and correct on the main branch (I just verified the raw files).
Procfile is: web: uvicorn api_app:app --host 0.0.0.0 --port $PORT
Healthcheck path in railway.toml is /health
The app starts successfully (logs show "Application startup complete" and Uvicorn running on 0.0.0.0:8000)

However, the healthcheck still fails with "1/1 replicas never became healthy!" even after increasing timeout to 1200 seconds.

The new classical version never becomes live — Railway always falls back to the old v1.4 (with ML).

There is also a stuck Custom Build Command in the Railway UI (pip install -r requirements.txt && python -c "from sentence_transformers...") that I cannot remove, even though the new code doesn't use ML.Any idea why the container starts but healthcheck never passes?

Thank you.

marunigno-ship-it

PROOP

21 days ago

Persistent deployment failure with 4 yellow triangles in a row – new classical version never becomes live despite following all suggestions

Dear Railway Support Team,I am writing about a persistent and severe deployment issue that has been ongoing for several days on my service.

Project/Service details:

GitHub repo: marunigno-ship-it/QERRA-v2-api (main branch)
Service: QERRA-v2-api-production
Public domain: qerra-v2-api-production.up.railway.app Problem: Every new deployment of my clean classical api_app.py (1.4-classical with real SEMEV vectors) results in:
- Successful build
- Multiple yellow triangles (currently 4 in a row)
- Healthcheck failure after long retries
- Automatic fallback to the old v1.4 version with ML layer Live /health always returns: {"status":"ok","version":"1.4","ml":{"enabled":false,"model_name":"paraphrase-MiniLM-L3-v2","loaded":false,"error":null}} What has been tried (following your team’s suggestions): * Correct api_app.py, vector_loader.py, Procfile, requirements.txt * Updated railway.toml with explicit startCommand using $PORT and timeout 1200s * Removed startCommand as per latest diagnosis (letting Railpack auto-detect) * Multiple redeploys * Increased healthcheck timeout The latest diagnosis from your system said the startCommand with $PORT was causing the crash, so I removed it. Still the same problem — 4 yellow triangles and fallback to old version.This is blocking the entire project. I am on a paid plan and this level of repeated failure is not acceptable. Could you please investigate from the backend: * Why the new deployment never passes healthcheck even after removing startCommand * Why the public domain is not routing to the latest green deployment * Whether there is a stale Custom Build Command or routing cache I can provide screenshots, deployment IDs, full logs, or any other information needed.Thank you for your urgent assistance * Best regards, Marussa Metocharaki QERRA-v2 Project

Welcome!