我有很多服务器使用 Prometheus 进行监控,每个主机都有相同的指标.
I have many servers that monitors with Prometheus, every host has the same metrics.
我需要一个警报规则,当特定主机上的特定指标(例如 some_metrics)在 5 米后丢失时发出警报.
I need an alert rule that alerts when specific metric(such as some_metrics) missing on specific host after 5m.
我检查了 absent 和 absent_over_time 但这些函数不会返回缺失指标的标签,例如 ip 或 hostname.
I checked absent and absent_over_time but these functions do not return the labels of missing metric such as ip or hostname.
另外我应该声明我不想为每个主机创建规则.
Also I should state that I don't want to create a rule for each host.
我已经搜索过了,但没有找到任何解决方案.
I have searched about it but I don't find any solution.
有什么解决办法吗?
推荐答案为了获得标签,您需要一个包含所有您想要的标签的指标.通常,一个不错的选择是 up,它也区分缺失的指标和无法达到的目标.
In order to get the labels, you need a metric which has all the labels you want. Usually, a good choice is up which also distinguish between a missing metric and an unreachable target.
如果 up (on a job) 为 1,规则将发出警报,如果实例上存在指标,UNLESS 二元运算符将禁用警报:
The rule will alert if up (on a job) is 1 and the UNLESS binary operator will disable the alert if the metric is present on the instance:
- alert: MissingMetricInFooTarget rule: up{job="foo"} == 1 UNLESS ON(instance) some_metrics{job="foo"}更多推荐
警报管理器中许多主机的警报缺失指标
发布评论