Configurable Monitoring Alerts in Observability Panel (Beta)

chandrika

EMPLOYEEOP

7 months ago

We’ve added configurable monitoring alerts to the Observability panel in the dashboard.

To start, monitors will send you an email when a threshold is reached.

These alerting thresholds are configurable above or below specified limits for:

CPU
RAM
Disk usage
Network egress

Monitors are very easy to set up. They look like this.

Since you can set alerts to trigger above or below thresholds, there are a lot of different use cases. You might set alerts above a threshold on RAM to avoid surprise bills from a memory leak. Or you might set alerts below a certain CPU threshold to detect that your application has crashed.

If you don’t know where to start, try setting a monitor on Network Egress, which is by far the most common cause of unintended cost overruns for most users. (Remember to always use Private Networking, people!)

We hope you like the direction we’re heading here. It makes us sad every time we read about another victim of some obscure pricing accident when using a hyperscale cloud. We’re trying to do the opposite of what they do and make it extraordinarily easy to control your costs under the belief that if we do right by you we’ll win your business over time.

Enjoy!

Attachments

Screenshot%...

Completed

1 Thread mentions this feature

4 Replies

Status changed to Completed chandrika • 7 months ago

isaac-hinman

PRO

7 months ago

This is a great start, but what we really need is alerting based on log output queries. Are you able to confirm if that will ever be supported? As suggested in the past, if the Railway team doesn't want to support that kind of alerting, I think supporting log drains should at least be implemented.

joeypedicini92

PRO

7 months ago

Absolutely need log drains/log based alerting

joeypedicini92

Absolutely need log drains/log based alerting

brody

EMPLOYEE

7 months ago

Could not agree more, so until we have that natively, locomotive + Datadog, Axiom, BetterStack, etc. is an excellent stand-in.

https://railway.com/deploy/locomotive

jjao

PRO

5 months ago

Hi there! Thank you for this, I've been testing this out, and it seems to work well as a general alerting if the threshold is reached. I had a few questions about the current implementation, and what's currently possible, as I couldn't find definitive answers on the documentation, or from my testing:

* Is the current alerting a (1) single-breach alert, i.e. as soon as the threshold is ever surpassed at any time, it will trigger the alert, or (2) is the default behaviour doing any sort of smart "windowed aggregation" to reduce noise, (e.g. takes the average over a minute and sees if it surpasses the threshold, sort of like what's outlined here: https://docs.datadoghq.com/monitors/types/metric/). From my testing, it seems like it might be (2), but would appreciate any clarification here.

* If it does (2), how big is the aggregation window?

* Is it possible to tune the monitoring alert such that it only sends the notification for sustained durations? E.g. if the CPU usage is above the threshold for over 5 minutes, only then trigger the alert? I tried looking at the roadmap for alerting, but didn't see anything on this.

My apologies if any of these are on the docs already and I just wasn't able to find them.