Increasing delay between health checks during deployment

federicoz123
PRO

a year ago

We have the following railway.json file:

{
  "$schema": "https://railway.app/railway.schema.json",
  "build": {
    "builder": "NIXPACKS",
    "buildCommand": "npx turbo build --filter=queue --force",
    "watchPatterns": ["apps/queue/**", "packages/**"]
  },
  "deploy": {
    "numReplicas": 1,
    "startCommand": "turbo run start --filter=queue",
    "healthcheckPath": "/health",
    "healthcheckTimeout": 90,
    "restartPolicyType": "ON_FAILURE",
    "restartPolicyMaxRetries": 5
  }
}

and our deployments + healthchecks usually look like:

====================Jun 14 16:19:18

Starting Healthcheck

Jun 14 16:19:18

====================

Jun 14 16:19:18

Jun 14 16:19:18

Path: /health

Jun 14 16:19:18

Retry window: 1m30s

Jun 14 16:19:18

Jun 14 16:19:18

Attempt #1 failed with service unavailable. Continuing to retry for 1m29s

Jun 14 16:19:20

Attempt #2 failed with service unavailable. Continuing to retry for 1m28s

Jun 14 16:19:22

Attempt #3 failed with service unavailable. Continuing to retry for 1m26s

Jun 14 16:19:26

Attempt #4 failed with service unavailable. Continuing to retry for 1m22s

Jun 14 16:19:34

Attempt #5 failed with service unavailable. Continuing to retry for 1m14s

Jun 14 16:19:50

[1/1] Healthcheck succeeded!

It always works on the 5th attempt since it takes some time for the /health endpoint to come up after building. We have 5 retries and 90 secs. As you can see, there is only a 1-2 second delay between each attempt, but we would like to add some kind of delay here. If not, we'll always burn through all of our retries at deployment (and not actually help during runtime).If we increase the timeout to 300 seconds, we'll still burn all 5 retries since the first four will use no more than 10 secs and the 290 remaining secs will be used by the last retry. Same thing if we increase to like 10 retries - it'll only work on the 10th one.

In the ideal solution, we can configure to either wait 1 minute after building is done to start trying the health checks OR we can add some kind of delay in between each health check (something like 20 secs). I couldn't find anything that could do this in your docs, but I'm assuming we're not the first to run into an issue like this.Maybe this is something that we can add to our package.json, but I'm not sure (a delay in the start script?)

Solved

11 Replies

a year ago

I think there may a misunderstanding here, there is no such thing as burning through the retries, the health check will run on a loop until the health check timeout has reached or gets a healthy response, and the amount of health check retries has no impact on the restart attempts amount during runtime.


federicoz123
PRO

a year ago

I guess I meant attempts. Is there any way to make railway wait in between the health check attempts instead of just trying again instantly?


a year ago

There isn't, but with the information provided, I can not see a reason as to why that is needed.

There is no attempt limit for health checks, it's purely on a loop until the timer runs out or the health check succeeds, it has nothing to do with the restart policy service setting.


federicoz123
PRO

a year ago

it seems like our turbo deployments/scripts need some time after building is complete until the health endpoint is ready. Would like to give them time after building before trying the healthchecks. Going to try adding a 60 second sleep as part of the start script in package.json to see if that helps i guess

Attachments


a year ago

That's only going to have more health checks fail, I honestly can't see any reason to delay the start of your app, or add a delay between health checks. There is no added costs to have the health check run until a healthy response is received.


federicoz123
PRO

a year ago

because I am using all 5 "restarts" at deployment but I want them for runtime in case it crashes


a year ago

As i've previously mentioned, the health checks do not have anything to do with the restart policy settings.

Health checks are also not restarting anything.


federicoz123
PRO

a year ago

OHHHH I understand now! Sorry, Brody - appreciate your help!This brings up another question as to why we aren't getting restarts triggered when our app crashes (if we aren't actually burning through the 5 retries). Do you have a quick TDLR on the difference between Always or On-Failure for the type? Not seeing anything in the docs


federicoz123
PRO

a year ago

just kidding - found answer on Kubernetes docs - assuming it's the same here.


a year ago

Railway does not use Kubernetes.

Always should restart your app regardless of the error code it exited with, On-Failure should restart your app only if it exits with an error code.

Please note that despite the "number of retries" option disappearing when you select Always it will still only restart up that set amount of retires.

It's quite possible Railway has already restarted your app enough times for you to run out of retries.


a year ago

There isn't a maximum number of retries with healthchecks. We will keep retrying for however long your healthcheck timeout is


Status changed to Solved Railway about 1 year ago


Increasing delay between health checks during deployment - Railway Help Station