4 months ago
We’ve added configurable monitoring alerts to the Observability panel in the dashboard.
To start, monitors will send you an email when a threshold is reached.
These alerting thresholds are configurable above or below specified limits for:
CPU
RAM
Disk usage
Network egress
Monitors are very easy to set up. They look like this.
Since you can set alerts to trigger above or below thresholds, there are a lot of different use cases. You might set alerts above a threshold on RAM to avoid surprise bills from a memory leak. Or you might set alerts below a certain CPU threshold to detect that your application has crashed.
If you don’t know where to start, try setting a monitor on Network Egress, which is by far the most common cause of unintended cost overruns for most users. (Remember to always use Private Networking, people!)
We hope you like the direction we’re heading here. It makes us sad every time we read about another victim of some obscure pricing accident when using a hyperscale cloud. We’re trying to do the opposite of what they do and make it extraordinarily easy to control your costs under the belief that if we do right by you we’ll win your business over time.
Enjoy!
Attachments
1 Thread mentions this feature
4 Replies
Status changed to Completed chandrika • 4 months ago
4 months ago
This is a great start, but what we really need is alerting based on log output queries. Are you able to confirm if that will ever be supported? As suggested in the past, if the Railway team doesn't want to support that kind of alerting, I think supporting log drains should at least be implemented.
4 months ago
Absolutely need log drains/log based alerting
joeypedicini92
Absolutely need log drains/log based alerting
4 months ago
Could not agree more, so until we have that natively, locomotive + Datadog, Axiom, BetterStack, etc. is an excellent stand-in.
a month ago
Hi there! Thank you for this, I've been testing this out, and it seems to work well as a general alerting if the threshold is reached. I had a few questions about the current implementation, and what's currently possible, as I couldn't find definitive answers on the documentation, or from my testing:
* Is the current alerting a (1) single-breach alert, i.e. as soon as the threshold is ever surpassed at any time, it will trigger the alert, or (2) is the default behaviour doing any sort of smart "windowed aggregation" to reduce noise, (e.g. takes the average over a minute and sees if it surpasses the threshold, sort of like what's outlined here: https://docs.datadoghq.com/monitors/types/metric/). From my testing, it seems like it might be (2), but would appreciate any clarification here.
* If it does (2), how big is the aggregation window?
* Is it possible to tune the monitoring alert such that it only sends the notification for sustained durations? E.g. if the CPU usage is above the threshold for over 5 minutes, only then trigger the alert? I tried looking at the roadmap for alerting, but didn't see anything on this.
My apologies if any of these are on the docs already and I just wasn't able to find them.
