Here's a common situation with my service: I run 2 replicas, and one of them becomes unresponsive (GC lock, race condition — who knows!) I'd expect Railway proxy to detect that, and stop directing traffic to it. Ideally even restart the replica so it becomes healthy again. [Reading the docs](https://docs.railway.app/guides/healthchecks-and-restarts), it's stated loud and clear that the healthcheck endpoint is only used for zero-downtime deployment, and Kuma is recommended for monitoring. This is not enough for me, though, because what I want is to restart the service that gets unhealthy. How can I do this with Railway?

Restart replicas when they become unhealthy

w1nt3r-eth

PROOP

2 years ago

Here's a common situation with my service: I run 2 replicas, and one of them becomes unresponsive (GC lock, race condition — who knows!)

I'd expect Railway proxy to detect that, and stop directing traffic to it. Ideally even restart the replica so it becomes healthy again.

Reading the docs, it's stated loud and clear that the healthcheck endpoint is only used for zero-downtime deployment, and Kuma is recommended for monitoring. This is not enough for me, though, because what I want is to restart the service that gets unhealthy.

How can I do this with Railway?

Solved

10 Replies

melissa

HOBBY

2 years ago

Hey! If a replica becomes unhealthy, it should behave the way you expect - it should no longer receive traffic, and depending on your restart policy and how it is failing, it will be restarted. If that's not been your experience, can you point me to some logs or provide more detail around your observations?

Status changed to Awaiting User Response Railway • over 1 year ago

w1nt3r-eth

PROOP

2 years ago

Unfortunately I don't have the logs. I suspect that the process got deadlocked, but didn't crash. Will Railway restart the process in this case?

Status changed to Awaiting Railway Response Railway • over 1 year ago

melissa

HOBBY

2 years ago

Hmm no unfortunately not, the app itself would have to crash or the main process otherwise exit for the restart policy to kick in. This may be a case for better error handling, which I know can be tricky if it was simply hung up and never crashed. Do you know where it got stuck at least?

Status changed to Awaiting User Response Railway • over 1 year ago

w1nt3r-eth

PROOP

2 years ago

Well, it's hard to know. Even if I did know and fixed that particular issue, I still don't feel comfortable — things like these can happen in the future. In any other load balancer setup, if the upstream stops responding, the LB usually marks it as unhealthy and stops directing traffic to it. Railway has an advantage of running both load balancer and scheduler, so in theory it should be able to restart replica when needed.

I could try implementing this in userland (have a separate service that checks replica's health) but there's no way for me to kill the process that becomes unresponsive. Do you have any tips?

Status changed to Awaiting Railway Response Railway • over 1 year ago

melissa

HOBBY

2 years ago

I hear that, but how would we know if the replica was unhealthy or simply taking a long time to process the request? Unless the app somehow indicates that it's unhealthy, then we're just guessing, which could potentially be a worse experience if we restart things because we think there is something wrong when there isn't.

To be clear, in your case, the replica was never throwing an error, but simply not responding to requests?

I can see us possibly adding controls around this situation, like implementing backpressure config or something. What are your thoughts?

I can log a feature request ticket, or you can start a Feedback thread and let others chime in. If a bunch of people have experienced the same issue as you, we'll prioritize it faster!

Status changed to Awaiting User Response Railway • over 1 year ago

w1nt3r-eth

PROOP

2 years ago

how would we know if the replica was unhealthy or simply taking a long time to process the request?

That's where the /health endpoint would be helpful!

I assume the Railway proxy does have a timeout, it won't wait for the app to respond forever. I'd be super useful to have a config for that timeout, and an option to mark the node as unhealthy if most requests result in the timeout.

Here's how HAProxy and nginx can handle healthchecks.

Status changed to Awaiting Railway Response Railway • over 1 year ago

melissa

HOBBY

2 years ago

Ah! check out this feedback thread: https://help.railway.app/feedback/liveliness-probes-215e0bdc

Jake created that 3 months ago, you should definitely drop a comment in there.

Status changed to Awaiting User Response Railway • over 1 year ago

create-app-ai

PRO

2 years ago

+1 would love to have option to auto restart unhealthy instances (and route to only healthy instances), this is needed to avoid downtime

Status changed to Solved nico • over 1 year ago

Status changed to Awaiting Railway Response w1nt3r-eth • over 1 year ago

w1nt3r-eth

PROOP

2 years ago

Hey @melissa, I see you marked the thread as solved, but the issues is still there and is very hard to work around, since there's also no API to restart the app

brody

EMPLOYEE

2 years ago

Thread has been solved because the question has been answered: we do not support restarting individual replicas at this time.

Though, as Melissa has said, we do not route traffic to unhealthy replicas.

I see you have commented on the feature request post Melissa linked and that's great, it lets us know what features our users want!

Going to close this thread out so future readers can comment on the linked feature request post -

https://help.railway.app/feedback/liveliness-probes-215e0bdc

Side note, you can always roll your own solution with Caddy, Caddy will play very nicely (Compared to HAPROXY / NGINX) with how we do replicas and private networking, if you are interested in that open a new thread please.

Status changed to Awaiting User Response Railway • over 1 year ago

Welcome!