Prometheus警报管理器不发送警报k8s

编程入门行业动态更新时间:2024-10-21 16:33:59

本文介绍了Prometheus警报管理器不发送警报k8s的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！问题描述

我正在使用Prometheus运算符0.3.4和警报管理器0.20，但它不起作用，即我看到警报被触发(在Alert选项卡上的Prometheus UI上)，但是我没有收到电子邮件的任何警报.通过查看日志，我看到以下内容，您知道吗?请以粗体显示警告，也许是原因，但不确定如何解决...

Im using prometheus operator 0.3.4 and alert manager 0.20 and it doesnt work, i.e. I see that the alert is fired (on prometheus UI on the alerts tab) but I didnt get any alert to the email. by looking at the logs I see the following , any idea ? please see the warn in bold maybe this is the reason but not sure how to fix it...

这是我使用的普罗米修斯算子的掌舵人: github/helm/charts/tree/master/稳定/prometheus-operator

This is the helm of prometheus operator which I use: github/helm/charts/tree/master/stable/prometheus-operator

level=info ts=2019-12-23T15:42:28.039Z caller=main.go:231 msg="Starting Alertmanager" version="(version=0.20.0, branch=HEAD, revision=f74be0400a6243d10bb53812d6fa408ad71ff32d)" level=info ts=2019-12-23T15:42:28.039Z caller=main.go:232 build_context="(go=go1.13.5, user=root@00c3106655f8, date=20191211-14:13:14)" level=warn ts=2019-12-23T15:42:28.109Z caller=cluster.go:228 component=cluster msg="failed to join cluster" err="1 error occurred:\n\t* Failed to resolve alertmanager-monitoring-prometheus-oper-alertmanager-0.alertmanager-operated.monitoring.svc:9094: lookup alertmanager-monitoring-prometheus-oper-alertmanager-0.alertmanager-operated.monitoring.svc on 100.64.0.10:53: no such host\n\n" level=info ts=2019-12-23T15:42:28.109Z caller=cluster.go:230 component=cluster msg="will retry joining cluster every 10s" level=warn ts=2019-12-23T15:42:28.109Z caller=main.go:322 msg="unable to join gossip mesh" err="1 error occurred:\n\t* Failed to resolve alertmanager-monitoring-prometheus-oper-alertmanager-0.alertmanager-operated.monitoring.svc:9094: lookup alertmanager-monitoring-prometheus-oper-alertmanager-0.alertmanager-operated.monitoring.svc on 100.64.0.10:53: no such host\n\n" level=info ts=2019-12-23T15:42:28.109Z caller=cluster.go:623 component=cluster msg="Waiting for gossip to settle..." interval=2s level=info ts=2019-12-23T15:42:28.131Z caller=coordinator.go:119 component=configuration msg="Loading configuration file" file=/etc/alertmanager/config/alertmanager.yaml level=info ts=2019-12-23T15:42:28.132Z caller=coordinator.go:131 component=configuration msg="Completed loading of configuration file" file=/etc/alertmanager/config/alertmanager.yaml level=info ts=2019-12-23T15:42:28.134Z caller=main.go:416 component=configuration msg="skipping creation of receiver not referenced by any route" receiver=AlertMail level=info ts=2019-12-23T15:42:28.134Z caller=main.go:416 component=configuration msg="skipping creation of receiver not referenced by any route" receiver=AlertMail2 level=info ts=2019-12-23T15:42:28.135Z caller=main.go:497 msg=Listening address=:9093 level=info ts=2019-12-23T15:42:30.110Z caller=cluster.go:648 component=cluster msg="gossip not settled" polls=0 before=0 now=1 elapsed=2.00011151s level=info ts=2019-12-23T15:42:38.110Z caller=cluster.go:640 component=cluster msg="gossip settled; proceeding" elapsed=10.000659096s

这是我的配置yaml

this is my config yaml

global: imagePullSecrets: [] prometheus-operator: defaultRules: grafana: enabled: true prometheusOperator: tolerations: - key: "WorkGroup" operator: "Equal" value: "operator" effect: "NoSchedule" - key: "WorkGroup" operator: "Equal" value: "operator" effect: "NoExecute" tlsProxy: image: repository: squareup/ghostunnel tag: v1.4.1 pullPolicy: IfNotPresent resources: limits: cpu: 8000m memory: 2000Mi requests: cpu: 2000m memory: 2000Mi admissionWebhooks: patch: priorityClassName: "operator-critical" image: repository: jettech/kube-webhook-certgen tag: v1.0.0 pullPolicy: IfNotPresent serviceAccount: name: prometheus-operator image: repository: quay.io/coreos/prometheus-operator tag: v0.34.0 pullPolicy: IfNotPresent prometheus: prometheusSpec: replicas: 1 serviceMonitorSelector: role: observeable tolerations: - key: "WorkGroup" operator: "Equal" value: "operator" effect: "NoSchedule" - key: "WorkGroup" operator: "Equal" value: "operator" effect: "NoExecute" ruleSelector: matchLabels: role: alert-rules prometheus: prometheus image: repository: quay.io/prometheus/prometheus tag: v2.13.1 alertmanager: alertmanagerSpec: image: repository: quay.io/prometheus/alertmanager tag: v0.20.0 resources: limits: cpu: 500m memory: 1000Mi requests: cpu: 500m memory: 1000Mi serviceAccount: name: prometheus config: global: resolve_timeout: 1m smtp_smarthost: 'smtp.gmail:587' smtp_from: 'alertmanager@vsx' smtp_auth_username: 'ds.monitoring.grafana@gmail' smtp_auth_password: 'mypass' smtp_require_tls: false route: group_by: ['alertname', 'cluster'] group_wait: 45s group_interval: 5m repeat_interval: 1h receiver: default-receiver routes: - receiver: str match_re: cluster: "canary|canary2" receivers: - name: default-receiver - name: str email_configs: - to: 'rayndoll007@gmail' from: alertmanager@vsx smarthost: smtp.gmail:587 auth_identity: ds.monitoring.grafana@gmail auth_username: ds.monitoring.grafana@gmail auth_password: mypass - name: 'AlertMail' email_configs: - to: 'rayndoll007@gmail'

codebeautify/yaml-validator/cb6a2781

错误表明它在解析中失败，名为alertmanager-monitoring-prometheus-oper-alertmanager-0的容器名称已启动并正在运行，但是它尝试解决:查找alertmanager-monitoring-prometheus-oper-alertmanager-0.alertmanager-operated.monitoring.svc不确定原因...

The error says it failed in the resolve , the pod name called alertmanager-monitoring-prometheus-oper-alertmanager-0 which is up and running however it try to resolve : lookup alertmanager-monitoring-prometheus-oper-alertmanager-0.alertmanager-operated.monitoring.svc not sure why...

这是kubectl get svc -n mon

更新这是警告日志

level=warn ts=2019-12-24T12:10:21.293Z caller=cluster.go:438 component=cluster msg=refresh result=failure addr=alertmanager-monitoring-prometheus-oper-alertmanager-0.alertmanager-operated.monitoring.svc:9094 level=warn ts=2019-12-24T12:10:21.323Z caller=cluster.go:438 component=cluster msg=refresh result=failure addr=alertmanager-monitoring-prometheus-oper-alertmanager-1.alertmanager-operated.monitoring.svc:9094 level=warn ts=2019-12-24T12:10:21.326Z caller=cluster.go:438 component=cluster msg=refresh result=failure addr=alertmanager-monitoring-prometheus-oper-alertmanager-2.alertmanager-operated.monitoring.svc:9094

这是kubectl get svc -n mon

alertmanager-operated ClusterIP None <none> 9093/TCP,9094/TCP,9094/UDP 6m4s monitoring-grafana ClusterIP 100.11.215.226 <none> 80/TCP 6m13s monitoring-kube-state-metrics ClusterIP 100.22.248.232 <none> 8080/TCP 6m13s monitoring-prometheus-node-exporter ClusterIP 100.33.130.77 <none> 9100/TCP 6m13s monitoring-prometheus-oper-alertmanager ClusterIP 100.33.228.217 <none> 9093/TCP 6m13s monitoring-prometheus-oper-operator ClusterIP 100.21.229.204 <none> 8080/TCP,443/TCP 6m13s monitoring-prometheus-oper-prometheus ClusterIP 100.22.93.151 <none> 9090/TCP 6m13s prometheus-operated ClusterIP None <none> 9090/TCP 5m54s

推荐答案

正确的调试步骤可帮助解决以下情况:

Proper debug steps to help with these kind of scenarios:

启用Alertmanager调试日志:添加参数--log.level = debug

验证Alertmanager集群是否正确形成(检查/status端点并验证所有对等节点都已列出)

验证Prometheus是否正在向所有Alertmanager对等方发送警报(检查/status端点并验证是否列出了所有Alertmanager对等方)

端到端测试:生成测试警报，应在Prometheus UI中看到警报，然后在Alertmanager UI中看到警报，最后应该看到警报通知.

更多推荐

Prometheus警报管理器不发送警报k8s

本文发布于:2023-11-27 03:42:15，感谢您对本站的认可！

本文链接:https://www.elefans.com/category/jswz/34/1636432.html