Deployment healthcheck fails even though candidate instance returns 200 in-container

Question

We are seeing repeated API deployment promotion failures on Railway for our production service, even though the exact unpromoted candidate instance is healthy and returns 200 OK for the configured healthcheck paths from inside the container.

Project / service:

* Project: patient-love
* Environment: production
* Service: @agora/api
* Service ID: 32030303-6651-444c-9d91-d4bea44b30ed

Affected deployment:

* Deployment ID: 912720c3-5fe1-452b-b069-c83d944ad68c
* Commit: c8a10f5bf75542af3e3ab94ce92878fd3d2f6ad6
* Healthcheck path configured in Railway: /healthz
* Healthcheck timeout: 180s

Candidate instance:

* Deployment instance ID: 8f3cdf25-e447-48db-9e78-94bff32ca6f2

What Railway reports:

* Railway build/deploy logs show repeated healthcheck failures with service unavailable
* The deployment stays in DEPLOYING and eventually fails

What we verified directly:  
We SSHed into the exact candidate instance 8f3cdf25-e447-48db-9e78-94bff32ca6f2 while it was still unpromoted and ran the health probes against the local process on 127.0.0.1:3000.

Results from inside that exact candidate instance:

* HEAD /healthz -> 200
* HEAD /api/health -> 200
* GET /healthz -> 200
* GET /api/health -> 200

Example response body from the candidate instance:

`{ "ok": true, "service": "api", "releaseId": "c8a10f5bf755", "gitSha": "c8a10f5bf75542af3e3ab94ce92878fd3d2f6ad6", "runtimeVersion": "c8a10f5bf755", "identitySource": "provider_env", "checkedAt": "2026-03-28T02:28:10.381Z", "readiness": { "databaseSchema": { "ok": true, "contract": { "ok": true, "expected": "agora-runtime:2026-03-27:agent-notifications-v1", "actual": "agora-runtime:2026-03-27:agent-notifications-v1" }, "failures": [] }, "authoringPublishConfig": { "ok": true, "failures": [] } } }`

App logs from the same candidate instance confirm successful responses:

* HEAD /healthz status 200
* HEAD /api/health status 200
* GET /healthz status 200
* GET /api/health status 200

At the same time, Railway’s build log for this deployment still reports:

* Attempt #1 failed with service unavailable
* ...
* Attempt #8 failed with service unavailable
* then 1/1 replicas never became healthy

Why we believe this is Railway-side:  
We already ruled out the common app causes:

* not a schema mismatch
* not app readiness returning 503
* not missing HEAD support
* not /healthz vs /api/health
* not route-order issues
* not worker/runtime poisoning

The exact candidate replica is returning 200 for both HEAD and GET on the health endpoints, yet Railway still marks the deployment unhealthy.

Questions we need answered:

1. What exact HTTP method is Railway using for deploy healthchecks on this service?
2. What exact host/port/path is Railway probing internally?
3. Is Railway probing the candidate instance directly, or through a proxy/load balancer/edge path?
4. Is Railway generating a platform-side 503 before the request reaches the app?
5. Can you trace why deployment 912720c3-5fe1-452b-b069-c83d944ad68c was judged unhealthy when instance 8f3cdf25-e447-48db-9e78-94bff32ca6f2 was locally healthy and serving 200?

Impact:

* Public API remains stuck on prior stable revision 93f6fe47c5e536c331a3912698fcf438d96826f5
* We can keep production healthy on the prior revision, but new API deployments cannot promote

Please escalate this to the infrastructure/platform team, because the application is healthy inside the candidate instance while Railway’s promotion path still reports it unhealthy.

**Internal Summary**  
Incident summary:

* Production schema/runtime outage was fixed earlier.
* Follow-up hardening was implemented, including preventing failed API candidates from poisoning worker\_runtime\_control.
* Current blocker is narrower: API deploys build and run, but Railway refuses promotion.

What we proved:

* Candidate API replica starts normally.
* Candidate API replica becomes healthy.
* Exact candidate instance returns:  
   * HEAD /healthz 200  
   * HEAD /api/health 200  
   * GET /healthz 200  
   * GET /api/health 200
* App logs confirm those successful probe responses.
* Railway still fails the deployment as unhealthy.

What this means:

* The remaining issue is not app readiness logic.
* The remaining issue is not schema drift.
* The remaining issue is not route behavior.
* The remaining issue is Railway’s healthcheck/promotion path, likely probing a different path or failing before reaching the app.

Current production state:

* Public API healthy on 93f6fe47c5e5
* Worker healthy on 93f6fe47c5e5
* worker\_runtime\_control.active\_runtime\_version = 93f6fe47c5e5
* Indexer healthy
* Scoring restored

Outstanding blocker:

* Railway must explain or fix why a healthy candidate instance is still judged unhealthy during promotion.