Persistent Health Check Routing Failures in EU West Region - Service Down

oronico

PROOP

9 days ago

Service: @schoolstackbudget/api-server

Environment: production

Region: europe-west4-drams3a (EU West)

Project ID: ac396ee9-0404-4498-866e-6e9228964aa6

Issue: My API service is experiencing persistent health check failures, preventing deployments from succeeding. The application itself is healthy and starts correctly, but Railway's health check requests are not reaching the container.

Symptoms:

Application starts successfully, runs migrations, and listens on port 8080

No application errors or missing variables

Health check requests fail to route to the container during deployment

Same commit has succeeded once (May 19, 18:18 UTC) but failed 7+ times after with zero code changes between deployments

Failure rate: 7 out of 8 deployments of the same commit

Failed Deployment IDs:

813bffce-d1e3-4e71-9267-c2077034cd6a (May 20, 16:55 UTC)

27946796-62fa-457f-95be-2c502ca7ca57 (May 20, 16:39 UTC)

b264fef9-5e17-4a90-9df6-a95698b20674 (May 20, 17:03 UTC)

2f895ae1-173f-4fe0-bf90-89c7c3f7025c (May 20, 17:07 UTC)

Successful Deployment ID (for comparison):

4d74813d-0fdf-42ee-b060-27a2426699b5 (May 19, 18:18 UTC)

Diagnosis: Railway's deployment diagnostics indicate: "The application starts correctly and listens on port 8080, but healthcheck requests are not reaching the container. This has happened on 7 out of 8 deployments of the same commit with no code or config changes between them."

Impact: Site is currently down. This appears to be a persistent routing issue in the eu-west4 region that redeploying does not reliably resolve.

Request: Please investigate the routing/load balancer configuration for eu-west4-drams3a and determine why health check requests are failing to reach healthy containers.

Awaiting Railway Response

9 Replies

Status changed to Awaiting Railway Response Railway • 9 days ago

sam-a

EMPLOYEE

9 days ago

Apologies for this canned message but in an effort to help all our customers get back up and running, we are sending this bulk message. As you may know, we had a major interruption to our services yesterday. We've published a post-mortem if you'd like more information on the incident. It describes what happened and what we are doing to prevent it in the future. We are deeply sorry for the impact that it has had on you.

It is taking some time to bring everything back up, but we are working on it as fast as we can. In general, a redeployment should fix most service issues. Due to the volume of customers redeploying right now, builds and deploys may take longer than normal to process.

You can track recovery status here: https://status.railway.com/incident/KVZ1Z8GY

If you are still having other issues that might be related to the incident you can read more here: https://station.railway.com/community/road-to-recovery-post-gcp-outage-builds-d362e48c

Feel free to respond if your question has not been addressed.

Status changed to Awaiting User Response Railway • 9 days ago

sam-a

EMPLOYEE

9 days ago

You can track recovery status here: https://status.railway.com/incident/KVZ1Z8GY

If you are still having other issues that might be related to the incident you can read more here: https://station.railway.com/community/road-to-recovery-post-gcp-outage-builds-d362e48c

Feel free to respond if your question has not been addressed.

oronico

PROOP

9 days ago

I NEED URGENT HELP! The only think left to do is delete everything and start over fresh with github -- please help!!!

Status changed to Awaiting Railway Response Railway • 9 days ago

mykal

EMPLOYEE

9 days ago

Please do not delete your project - that would cause permanent data loss and will not help here. Your application is starting correctly and listening on port 8080 with no errors.

As a temporary workaround, remove the health check path from your service settings (Settings > Deploy > Healthcheck Path, clear the field and save), then redeploy. This will allow your deployment to go live without waiting for a health check response. You can find more details on health check configuration here: https://docs.railway.com/deployments/healthchecks

Once your service is back online, we'd recommend re-adding the health check and investigating why your container isn't responding to health check requests during deployment.

Status changed to Awaiting User Response Railway • 9 days ago

Status changed to Solved mykal • 9 days ago

oronico

PROOP

9 days ago

The site has been down all day and once again the health checks aren't working. I need help-- everything was working before your issues today and now my site is down

Status changed to Awaiting Railway Response Railway • 9 days ago

chandrika

EMPLOYEE

9 days ago

Hey, sorry your site has been down all day and for the duplicate canned responses earlier.

Did you get a chance to try the workaround Mykal suggested? Removing the health check path temporarily (Settings > Deploy > Healthcheck Path, clear the field and save) and then redeploying should get your service back online while we sort out the underlying routing issue in EU West. Your app is healthy, it's the health check routing that's failing.

Status changed to Awaiting User Response Railway • 9 days ago

chandrika

Hey, sorry your site has been down all day and for the duplicate canned responses earlier. Did you get a chance to try the workaround Mykal suggested? Removing the health check path temporarily (Settings > Deploy > Healthcheck Path, clear the field and save) and then redeploying should get your service back online while we sort out the underlying routing issue in EU West. Your app is healthy, it's the health check routing that's failing.

oronico

PROOP

8 days ago

Appreciate your help. Site is still giving me a 502 error. I'm going to try to shift the front end to netlify and use railway for API/Postgres bc I don't know what else to do.

Status changed to Awaiting Railway Response Railway • 8 days ago

oronico

PROOP

8 days ago

We are still experiencing a production outage on a Railway API service.

Service/domain:

schoolstackbudget.up.railway.app
Public requests to /health and /api/ready return Railway’s “Application failed to respond” page / 502.
Example Request IDs:
- 1k51WwPMS7apm5EayCLmYg
- 1xav573zT02Q8h8GyCLmYg

The container deploy logs show the app booted successfully and is listening:

[preflight] PREFLIGHT_SKIP=1 set — skipping ledger/schema gate
[migrate] Schema up to date
[startup] WARN: R2 boot probe SKIPPED via SKIP_R2_BOOT_PROBE=1
[seed] SKIP_PREVIEW_SEED=true — skipping preview-data seed
Server listening on [::] (dual-stack):8080

Environment:

PORT=8080
Dockerfile-based deploy
Postgres is a separate Railway service in the same project
API connects to Postgres successfully during boot/migrations

Issue:

The container appears healthy and listening, but Railway public networking is not forwarding requests to it. This looks like a routing/service mesh issue rather than an application boot failure.

Can you investigate the public routing for this service/domain and confirm whether the service mesh registration is stuck or misconfigured?

oronico

PROOP

8 days ago

We need Railway support to investigate public routing for a new service.

Project/service:

Schoolstack_Budget

Generated domain:

schoolstackbudget-production.up.railway.app

We created a brand-new service and temporarily replaced the app with a minimal Node HTTP smoke server:

node -e "require('http').createServer((req,res)=>{console.log('[smoke-hit]',req.method,req.url);res.writeHead(200,{'content-type':'application/json'});res.end(JSON.stringify({ok:true,path:req.url,port:process.env.PORT||8080}))}).listen(process.env.PORT||8080,'0.0.0.0',()=>console.log('[smoke] listening on 0.0.0.0:'+(process.env.PORT||8080)))"

Deploy logs show:

[smoke] listening on 0.0.0.0:8080

But public requests still return Railway 502 / “Application failed to respond.” HTTP logs show requests at the edge, but they time out after 15s:

GET /health → 502, 15s
GET /api/ready → 502, 15s
GET / → 502, 15s

This proves the issue is not our Express app, database, migrations, R2, CORS, Netlify, or Postgres. Public routing is not reaching the listening process.

Please investigate target-port/public networking/service mesh routing for this service.

Welcome!