Deployment healthcheck fails even though candidate instance returns 200 in-container
andymolecule
FREEOP

2 months ago

We are seeing repeated API deployment promotion failures on Railway for our production service, even though the exact unpromoted candidate instance is healthy and returns 200 OK for the configured healthcheck paths from inside the container.

Project / service:

  • Project: patient-love
  • Environment: production
  • Service: @agora/api
  • Service ID: 32030303-6651-444c-9d91-d4bea44b30ed

Affected deployment:

  • Deployment ID: 912720c3-5fe1-452b-b069-c83d944ad68c
  • Commit: c8a10f5bf75542af3e3ab94ce92878fd3d2f6ad6
  • Healthcheck path configured in Railway: /healthz
  • Healthcheck timeout: 180s

Candidate instance:

  • Deployment instance ID: 8f3cdf25-e447-48db-9e78-94bff32ca6f2

What Railway reports:

  • Railway build/deploy logs show repeated healthcheck failures with service unavailable
  • The deployment stays in DEPLOYING and eventually fails

What we verified directly:

We SSHed into the exact candidate instance 8f3cdf25-e447-48db-9e78-94bff32ca6f2 while it was still unpromoted and ran the health probes against the local process on 127.0.0.1:3000.

Results from inside that exact candidate instance:

  • HEAD /healthz -> 200
  • HEAD /api/health -> 200
  • GET /healthz -> 200
  • GET /api/health -> 200

Example response body from the candidate instance:

{ "ok": true, "service": "api", "releaseId": "c8a10f5bf755", "gitSha": "c8a10f5bf75542af3e3ab94ce92878fd3d2f6ad6", "runtimeVersion": "c8a10f5bf755", "identitySource": "provider_env", "checkedAt": "2026-03-28T02:28:10.381Z", "readiness": { "databaseSchema": { "ok": true, "contract": { "ok": true, "expected": "agora-runtime:2026-03-27:agent-notifications-v1", "actual": "agora-runtime:2026-03-27:agent-notifications-v1" }, "failures": [] }, "authoringPublishConfig": { "ok": true, "failures": [] } } }

App logs from the same candidate instance confirm successful responses:

  • HEAD /healthz status 200
  • HEAD /api/health status 200
  • GET /healthz status 200
  • GET /api/health status 200

At the same time, Railway’s build log for this deployment still reports:

  • Attempt #1 failed with service unavailable
  • ...
  • Attempt #8 failed with service unavailable
  • then 1/1 replicas never became healthy

Why we believe this is Railway-side:

We already ruled out the common app causes:

  • not a schema mismatch
  • not app readiness returning 503
  • not missing HEAD support
  • not /healthz vs /api/health
  • not route-order issues
  • not worker/runtime poisoning

The exact candidate replica is returning 200 for both HEAD and GET on the health endpoints, yet Railway still marks the deployment unhealthy.

Questions we need answered:

  1. What exact HTTP method is Railway using for deploy healthchecks on this service?
  2. What exact host/port/path is Railway probing internally?
  3. Is Railway probing the candidate instance directly, or through a proxy/load balancer/edge path?
  4. Is Railway generating a platform-side 503 before the request reaches the app?
  5. Can you trace why deployment 912720c3-5fe1-452b-b069-c83d944ad68c was judged unhealthy when instance 8f3cdf25-e447-48db-9e78-94bff32ca6f2 was locally healthy and serving 200?

Impact:

  • Public API remains stuck on prior stable revision 93f6fe47c5e536c331a3912698fcf438d96826f5
  • We can keep production healthy on the prior revision, but new API deployments cannot promote

Please escalate this to the infrastructure/platform team, because the application is healthy inside the candidate instance while Railway’s promotion path still reports it unhealthy.

Internal Summary

Incident summary:

  • Production schema/runtime outage was fixed earlier.
  • Follow-up hardening was implemented, including preventing failed API candidates from poisoning worker_runtime_control.
  • Current blocker is narrower: API deploys build and run, but Railway refuses promotion.

What we proved:

  • Candidate API replica starts normally.
  • Candidate API replica becomes healthy.
  • Exact candidate instance returns:
    • HEAD /healthz 200
    • HEAD /api/health 200
    • GET /healthz 200
    • GET /api/health 200
  • App logs confirm those successful probe responses.
  • Railway still fails the deployment as unhealthy.

What this means:

  • The remaining issue is not app readiness logic.
  • The remaining issue is not schema drift.
  • The remaining issue is not route behavior.
  • The remaining issue is Railway’s healthcheck/promotion path, likely probing a different path or failing before reaching the app.

Current production state:

  • Public API healthy on 93f6fe47c5e5
  • Worker healthy on 93f6fe47c5e5
  • worker_runtime_control.active_runtime_version = 93f6fe47c5e5
  • Indexer healthy
  • Scoring restored

Outstanding blocker:

  • Railway must explain or fix why a healthy candidate instance is still judged unhealthy during promotion.
$10 Bounty

1 Replies

Status changed to Awaiting Railway Response Railway about 2 months ago


Status changed to Open Railway about 2 months ago


andreahlert
PRO

2 months ago

hey, this is almost certainly a PORT/hostname mismatch between what your app binds to and what Railway’s healthcheck prober actually hits. Railway’s healthcheck uses the internal IP on whatever port the PORT env var is set to, and it sends requests with the hostname healthcheck.railway.app. so even though your app responds fine on 127.0.0.1:3000 via SSH, if your app isn’t binding to 0.0.0.0 (or ::) on process.env.PORT, or if you have any hostname validation/allowed hosts middleware rejecting requests from healthcheck.railway.app, Railway’s prober will get a 503 before it ever touches your route handler. i’d double check three things: (1) your app is calling app.listen(Number(process.env.PORT), '0.0.0.0') and not hardcoding port 3000, (2) you don’t have a Dockerfile EXPOSE directive exposing multiple ports (Railway can get confused by that), and (3) if you’re using any framework level host filtering (like a CORS/allowed hosts check), make sure healthcheck.railway.app is whitelisted. the fact that you’re getting “service unavailable” specifically (not a 4xx or timeout) strongly points to Railway’s internal proxy not being able to route to your container on the expected port, which is the classic PORT mismatch symptom.


Welcome!

Sign in to your Railway account to join the conversation.

Loading...