a year ago
We have the following railway.json file:
{
"$schema": "https://railway.app/railway.schema.json",
"build": {
"builder": "NIXPACKS",
"buildCommand": "npx turbo build --filter=queue --force",
"watchPatterns": ["apps/queue/**", "packages/**"]
},
"deploy": {
"numReplicas": 1,
"startCommand": "turbo run start --filter=queue",
"healthcheckPath": "/health",
"healthcheckTimeout": 90,
"restartPolicyType": "ON_FAILURE",
"restartPolicyMaxRetries": 5
}
}
and our deployments + healthchecks usually look like:
====================Jun 14 16:19:18
Starting Healthcheck
Jun 14 16:19:18
====================
Jun 14 16:19:18
Jun 14 16:19:18
Path: /health
Jun 14 16:19:18
Retry window: 1m30s
Jun 14 16:19:18
Jun 14 16:19:18
Attempt #1 failed with service unavailable. Continuing to retry for 1m29s
Jun 14 16:19:20
Attempt #2 failed with service unavailable. Continuing to retry for 1m28s
Jun 14 16:19:22
Attempt #3 failed with service unavailable. Continuing to retry for 1m26s
Jun 14 16:19:26
Attempt #4 failed with service unavailable. Continuing to retry for 1m22s
Jun 14 16:19:34
Attempt #5 failed with service unavailable. Continuing to retry for 1m14s
Jun 14 16:19:50
[1/1] Healthcheck succeeded!
It always works on the 5th attempt since it takes some time for the /health endpoint to come up after building. We have 5 retries and 90 secs. As you can see, there is only a 1-2 second delay between each attempt, but we would like to add some kind of delay here. If not, we'll always burn through all of our retries at deployment (and not actually help during runtime).If we increase the timeout to 300 seconds, we'll still burn all 5 retries since the first four will use no more than 10 secs and the 290 remaining secs will be used by the last retry. Same thing if we increase to like 10 retries - it'll only work on the 10th one.
In the ideal solution, we can configure to either wait 1 minute after building is done to start trying the health checks OR we can add some kind of delay in between each health check (something like 20 secs). I couldn't find anything that could do this in your docs, but I'm assuming we're not the first to run into an issue like this.Maybe this is something that we can add to our package.json, but I'm not sure (a delay in the start script?)
11 Replies
a year ago
I think there may a misunderstanding here, there is no such thing as burning through the retries, the health check will run on a loop until the health check timeout has reached or gets a healthy response, and the amount of health check retries has no impact on the restart attempts amount during runtime.
a year ago
I guess I meant attempts. Is there any way to make railway wait in between the health check attempts instead of just trying again instantly?
a year ago
There isn't, but with the information provided, I can not see a reason as to why that is needed.
There is no attempt limit for health checks, it's purely on a loop until the timer runs out or the health check succeeds, it has nothing to do with the restart policy service setting.
a year ago
it seems like our turbo deployments/scripts need some time after building is complete until the health endpoint is ready. Would like to give them time after building before trying the healthchecks. Going to try adding a 60 second sleep as part of the start script in package.json to see if that helps i guess
Attachments
a year ago
That's only going to have more health checks fail, I honestly can't see any reason to delay the start of your app, or add a delay between health checks. There is no added costs to have the health check run until a healthy response is received.
a year ago
because I am using all 5 "restarts" at deployment but I want them for runtime in case it crashes
a year ago
As i've previously mentioned, the health checks do not have anything to do with the restart policy settings.
Health checks are also not restarting anything.
a year ago
OHHHH I understand now! Sorry, Brody - appreciate your help!This brings up another question as to why we aren't getting restarts triggered when our app crashes (if we aren't actually burning through the 5 retries). Do you have a quick TDLR on the difference between Always
or On-Failure
for the type? Not seeing anything in the docs
a year ago
just kidding - found answer on Kubernetes docs - assuming it's the same here.
a year ago
Railway does not use Kubernetes.
Always
should restart your app regardless of the error code it exited with, On-Failure
should restart your app only if it exits with an error code.
Please note that despite the "number of retries" option disappearing when you select Always
it will still only restart up that set amount of retires.
It's quite possible Railway has already restarted your app enough times for you to run out of retries.
a year ago
There isn't a maximum number of retries with healthchecks. We will keep retrying for however long your healthcheck timeout is
Status changed to Solved Railway • about 1 year ago