使用Prometheus+Alertmanager告警JVM異常情況

lushan 發布于2019-08-16 13:31 / 1971人閱讀

摘要：，負責抓取存儲指標信息，并提供查詢功能，本文重點使用它的告警功能。，負責將告警通知給相關人員。配置的告警觸發規則使用超過最大上限的機時間超過秒分鐘分鐘時間在最近分鐘里超過配置連接，配置。

原文地址

在前一篇文章中提到了如何使用Prometheus+Grafana來監控JVM。本文介紹如何使用Prometheus+Alertmanager來對JVM的某些情況作出告警。

本文所提到的腳本可以在這里下載。

摘要

用到的工具：

Docker，本文大量使用了Docker來啟動各個應用。

Prometheus，負責抓取/存儲指標信息，并提供查詢功能，本文重點使用它的告警功能。

Grafana，負責數據可視化（本文重點不在于此，只是為了讓讀者能夠直觀地看到異常指標）。

Alertmanager，負責將告警通知給相關人員。

JMX exporter，提供JMX中和JVM相關的metrics。

Tomcat，用來模擬一個Java應用。

先講一下大致步驟：

利用JMX exporter，在Java進程內啟動一個小型的Http server

配置Prometheus抓取那個Http server提供的metrics。

配置Prometheus的告警觸發規則

heap使用超過最大上限的50%、80%、90%

instance down機時間超過30秒、1分鐘、5分鐘

old gc時間在最近5分鐘里超過50%、80%

配置Grafana連接Prometheus，配置Dashboard。

配置Alertmanager的告警通知規則

告警的大致過程如下：

Prometheus根據告警觸發規則查看是否觸發告警，如果是，就將告警信息發送給Alertmanager。

Alertmanager收到告警信息后，決定是否發送通知，如果是，則決定發送給誰。

第一步：啟動幾個Java應用

1) 新建一個目錄，名字叫做prom-jvm-demo。

2) 下載JMX exporter到這個目錄。

3) 新建一個文件simple-config.yml內容如下：

---
lowercaseOutputLabelNames: true
lowercaseOutputName: true
whitelistObjectNames: ["java.lang:type=OperatingSystem"]
rules:
 - pattern: "java.lang<>((?!process_cpu_time)w+):"
   name: os_$1
   type: GAUGE
   attrNameSnakeCase: true

4) 運行以下命令啟動3個Tomcat，記得把替換成正確的路徑（這里故意把-Xmx和-Xms設置的很小，以觸發告警條件）：

docker run -d 
  --name tomcat-1 
  -v :/jmx-exporter 
  -e CATALINA_OPTS="-Xms32m -Xmx32m -javaagent:/jmx-exporter/jmx_prometheus_javaagent-0.3.1.jar=6060:/jmx-exporter/simple-config.yml" 
  -p 6060:6060 
  -p 8080:8080 
  tomcat:8.5-alpine

docker run -d 
  --name tomcat-2 
  -v :/jmx-exporter 
  -e CATALINA_OPTS="-Xms32m -Xmx32m -javaagent:/jmx-exporter/jmx_prometheus_javaagent-0.3.1.jar=6060:/jmx-exporter/simple-config.yml" 
  -p 6061:6060 
  -p 8081:8080 
  tomcat:8.5-alpine

docker run -d 
  --name tomcat-3 
  -v :/jmx-exporter 
  -e CATALINA_OPTS="-Xms32m -Xmx32m -javaagent:/jmx-exporter/jmx_prometheus_javaagent-0.3.1.jar=6060:/jmx-exporter/simple-config.yml" 
  -p 6062:6060 
  -p 8082:8080 
  tomcat:8.5-alpine

5) 訪問http://localhost:8080|8081|8082看看Tomcat是否啟動成功。

6) 訪問對應的http://localhost:6060|6061|6062看看JMX exporter提供的metrics。

備注：這里提供的simple-config.yml僅僅提供了JVM的信息，更復雜的配置請參考JMX exporter文檔。

第二步：啟動Prometheus

1) 在之前新建目錄prom-jvm-demo，新建一個文件prom-jmx.yml，內容如下：

scrape_configs:
  - job_name: "java"
    static_configs:
    - targets:
      - ":6060"
      - ":6061"
      - ":6062"

# alertmanager的地址
alerting:
  alertmanagers:
  - static_configs:
    - targets:
      - ":9093"

# 讀取告警觸發條件規則
rule_files:
  - "/prometheus-config/prom-alert-rules.yml"

2) 新建文件prom-alert-rules.yml，該文件是告警觸發規則：

# severity按嚴重程度由高到低：red、orange、yello、blue
groups:
  - name: jvm-alerting
    rules:

    # down了超過30秒
    - alert: instance-down
      expr: up == 0
      for: 30s
      labels:
        severity: yellow
      annotations:
        summary: "Instance {{ $labels.instance }} down"
        description: "{{ $labels.instance }} of job {{ $labels.job }} has been down for more than 30 seconds."

    # down了超過1分鐘
    - alert: instance-down
      expr: up == 0
      for: 1m
      labels:
        severity: orange
      annotations:
        summary: "Instance {{ $labels.instance }} down"
        description: "{{ $labels.instance }} of job {{ $labels.job }} has been down for more than 1 minutes."

    # down了超過5分鐘
    - alert: instance-down
      expr: up == 0
      for: 5m
      labels:
        severity: red
      annotations:
        summary: "Instance {{ $labels.instance }} down"
        description: "{{ $labels.instance }} of job {{ $labels.job }} has been down for more than 5 minutes."

    # 堆空間使用超過50%
    - alert: heap-usage-too-much
      expr: jvm_memory_bytes_used{job="java", area="heap"} / jvm_memory_bytes_max * 100 > 50
      for: 1m
      labels:
        severity: yellow
      annotations:
        summary: "JVM Instance {{ $labels.instance }} memory usage > 50%"
        description: "{{ $labels.instance }} of job {{ $labels.job }} has been in status [heap usage > 50%] for more than 1 minutes. current usage ({{ $value }}%)"

    # 堆空間使用超過80%
    - alert: heap-usage-too-much
      expr: jvm_memory_bytes_used{job="java", area="heap"} / jvm_memory_bytes_max * 100 > 80
      for: 1m
      labels:
        severity: orange
      annotations:
        summary: "JVM Instance {{ $labels.instance }} memory usage > 80%"
        description: "{{ $labels.instance }} of job {{ $labels.job }} has been in status [heap usage > 80%] for more than 1 minutes. current usage ({{ $value }}%)"
    
    # 堆空間使用超過90%
    - alert: heap-usage-too-much
      expr: jvm_memory_bytes_used{job="java", area="heap"} / jvm_memory_bytes_max * 100 > 90
      for: 1m
      labels:
        severity: red
      annotations:
        summary: "JVM Instance {{ $labels.instance }} memory usage > 90%"
        description: "{{ $labels.instance }} of job {{ $labels.job }} has been in status [heap usage > 90%] for more than 1 minutes. current usage ({{ $value }}%)"

    # 在5分鐘里，Old GC花費時間超過30%
    - alert: old-gc-time-too-much
      expr: increase(jvm_gc_collection_seconds_sum{gc="PS MarkSweep"}[5m]) > 5 * 60 * 0.3
      for: 5m
      labels:
        severity: yellow
      annotations:
        summary: "JVM Instance {{ $labels.instance }} Old GC time > 30% running time"
        description: "{{ $labels.instance }} of job {{ $labels.job }} has been in status [Old GC time > 30% running time] for more than 5 minutes. current seconds ({{ $value }}%)"

    # 在5分鐘里，Old GC花費時間超過50%        
    - alert: old-gc-time-too-much
      expr: increase(jvm_gc_collection_seconds_sum{gc="PS MarkSweep"}[5m]) > 5 * 60 * 0.5
      for: 5m
      labels:
        severity: orange
      annotations:
        summary: "JVM Instance {{ $labels.instance }} Old GC time > 50% running time"
        description: "{{ $labels.instance }} of job {{ $labels.job }} has been in status [Old GC time > 50% running time] for more than 5 minutes. current seconds ({{ $value }}%)"

    # 在5分鐘里，Old GC花費時間超過80%
    - alert: old-gc-time-too-much
      expr: increase(jvm_gc_collection_seconds_sum{gc="PS MarkSweep"}[5m]) > 5 * 60 * 0.8
      for: 5m
      labels:
        severity: red
      annotations:
        summary: "JVM Instance {{ $labels.instance }} Old GC time > 80% running time"
        description: "{{ $labels.instance }} of job {{ $labels.job }} has been in status [Old GC time > 80% running time] for more than 5 minutes. current seconds ({{ $value }}%)"

3) 啟動Prometheus：

docker run -d 
  --name=prometheus 
  -p 9090:9090 
  -v :/prometheus-config 
  prom/prometheus --config.file=/prometheus-config/prom-jmx.yml

4) 訪問http://localhost:9090/alerts應該能看到之前配置的告警規則：

如果沒有看到三個instance，那么等一會兒再試。

第三步：配置Grafana

參考使用Prometheus+Grafana監控JVM

第四步：啟動Alertmanager

1) 新建一個文件alertmanager-config.yml：

global:
  smtp_smarthost: ""
  smtp_from: ""
  smtp_auth_username: ""
  smtp_auth_password: ""

# The directory from which notification templates are read.
templates: 
- "/alertmanager-config/*.tmpl"

# The root route on which each incoming alert enters.
route:
  # The labels by which incoming alerts are grouped together. For example,
  # multiple alerts coming in for cluster=A and alertname=LatencyHigh would
  # be batched into a single group.
  group_by: ["alertname", "instance"]

  # When a new group of alerts is created by an incoming alert, wait at
  # least "group_wait" to send the initial notification.
  # This way ensures that you get multiple alerts for the same group that start
  # firing shortly after another are batched together on the first 
  # notification.
  group_wait: 30s

  # When the first notification was sent, wait "group_interval" to send a batch
  # of new alerts that started firing for that group.
  group_interval: 5m

  # If an alert has successfully been sent, wait "repeat_interval" to
  # resend them.
  repeat_interval: 3h 

  # A default receiver
  receiver: "user-a"

# Inhibition rules allow to mute a set of alerts given that another alert is
# firing.
# We use this to mute any warning-level notifications if the same alert is 
# already critical.
inhibit_rules:
- source_match:
    severity: "red"
  target_match_re:
    severity: ^(blue|yellow|orange)$
  # Apply inhibition if the alertname and instance is the same.
  equal: ["alertname", "instance"]
- source_match:
    severity: "orange"
  target_match_re:
    severity: ^(blue|yellow)$
  # Apply inhibition if the alertname and instance is the same.
  equal: ["alertname", "instance"]
- source_match:
    severity: "yellow"
  target_match_re:
    severity: ^(blue)$
  # Apply inhibition if the alertname and instance is the same.
  equal: ["alertname", "instance"]

receivers:
- name: "user-a"
  email_configs:
  - to: ""

修改里面關于smtp_*的部分和最下面user-a的郵箱地址。

~~備注：因為國內郵箱幾乎都不支持TLS，而Alertmanager目前又不支持SSL，因此請使用Gmail或其他支持TLS的郵箱來發送告警郵件，見這個issue~~，這個問題已經修復，下面是阿里云企業郵箱的配置例子：

smtp_smarthost: "smtp.qiye.aliyun.com:465"
smtp_hello: "company.com"
smtp_from: "username@company.com"
smtp_auth_username: "username@company.com"
smtp_auth_password: password
smtp_require_tls: false

2) 新建文件alert-template.tmpl，這個是郵件內容模板：

{{ define "email.default.html" }}
Summary
  
{{ .CommonAnnotations.summary }}

Description

{{ .CommonAnnotations.description }}
{{ end}}

3）運行下列命令啟動：

docker run -d 
  --name=alertmanager 
  -v :/alertmanager-config 
  -p 9093:9093 
  prom/alertmanager:master --config.file=/alertmanager-config/alertmanager-config.yml

4) 訪問http://localhost:9093，看看有沒有收到Prometheus發送過來的告警(如果沒有看到稍等一下)：

第五步：等待郵件

等待一會兒（最多5分鐘）看看是否收到郵件。如果沒有收到，檢查配置是否正確，或者docker logs alertmanager看看alertmanager的日志，一般來說都是郵箱配置錯誤導致。

GPU云服務器云服務器使用情況統計內存使用情況 webrtc使用情況 ecs內存使用情況

文章版權歸作者所有，未經允許請勿轉載,若此文章存在違規行為，您可以聯系管理員刪除。

轉載請注明本文地址：http://specialneedsforspecialkids.com/yun/71889.html

Kubernetes中使用prometheus+alertmanager實現監控告警

摘要：監控告警原型圖原型圖解釋與作為運行在同一個中并交由控制器管理，默認開啟端口，因為我們的與是處于同一個中，所以直接使用就可以與通信用于發送告警通知，告警規則配置以的形式掛載到容器供使用，告警通知對象配置也通過掛載到容器供使用，這里我們使用郵件監控告警原型圖 showImg(https://segmentfault.com/img/bVbhYgs?w=1280&h=962); 原型圖解釋...

wupengyu 2019-07-01 16:50 評論0 收藏0
數人云工程師手記 | 容器日志管理實踐

摘要：容器內文件日志平臺支持的文件存儲是，避免了許多復雜環境的處理。以上是數人云在實踐容器日志系統過程中遇到的問題，更高層次的應用包括容器日志分析等，還有待繼續挖掘和填坑，歡迎大家提出建議，一起交流。業務平臺每天產生大量日志數據，為了實現數據分析，需要將生產服務器上的所有日志收集后進行大數據分析處理，Docker提供了日志驅動，然而并不能滿足不同場景需求，本次將結合實例分享日志采集、存儲以...

saucxs 2019-06-28 15:35 評論0 收藏0
使用prometheus operator監控envoy

摘要：集群三步安裝概述應當是使用監控系統的最佳實踐了，首先它一鍵構建整個監控系統，通過一些無侵入的手段去配置如監控數據源等故障自動恢復，高可用的告警等。。 kubernetes集群三步安裝概述 prometheus operator應當是使用監控系統的最佳實踐了，首先它一鍵構建整個監控系統，通過一些無侵入的手段去配置如監控數據源等故障自動恢復，高可用的告警等。。不過對于新手使用上還是有一...

Jeff 2019-06-28 16:55 評論0 收藏0
使用prometheus operator監控envoy

摘要：集群三步安裝概述應當是使用監控系統的最佳實踐了，首先它一鍵構建整個監控系統，通過一些無侵入的手段去配置如監控數據源等故障自動恢復，高可用的告警等。。 kubernetes集群三步安裝概述 prometheus operator應當是使用監控系統的最佳實踐了，首先它一鍵構建整個監控系統，通過一些無侵入的手段去配置如監控數據源等故障自動恢復，高可用的告警等。。不過對于新手使用上還是有一...

sorra 2019-07-01 16:57 評論0 收藏0

發表評論

登陸后可評論

0條評論

lushan

男|高級講師

我要關注我要私信

TA的文章

XXMhost：美國洛杉磯CN2 GIA雲服務器終身7折40元/月起（美國原生IP雲服務器）

閱讀 3779·2021-09-02 09:53
精選VPS主機優惠信息動態-美國韓國日本香港VPS主機優惠碼(持續更新)

閱讀 2758·2021-07-30 14:57
推薦輕量高效無依賴的開源JS插件和庫

閱讀 3505·2019-08-30 13:09
vscode 修改注釋顏色

閱讀 1207·2019-08-29 13:25
CSS的居中方式

閱讀 818·2019-08-29 12:28
微信樣式組件遇到的小坑

閱讀 1463·2019-08-29 12:26
通用 CSS 筆記、建議與指導

閱讀 1139·2019-08-28 17:58
在JavaScript中理解策略模式

閱讀 3316·2019-08-26 13:28

国产xxxx99真实实拍_久久不雅视频_高清韩国a级特黄毛片_嗯老师别我我受不了了小说

資訊專欄INFORMATION COLUMN

上云采購季！| 2核2G4M爆款云服務器低至59元/年，更有多臺、長期優惠，快來選購！

使用Prometheus+Alertmanager告警JVM異常情況

Summary

Description

相關文章

**Kubernetes中使用prometheus+alertmanager實現監控告警**

數人云工程師手記 | 容器日志管理實踐

**使用prometheus operator監控envoy**

**使用prometheus operator監控envoy**

發表評論

0條評論

lushan

男|高級講師

TA的文章

XXMhost：美國洛杉磯CN2 GIA雲服務器終身7折40元/月起（美國原生IP雲服務器）

精選VPS主機優惠信息動態-美國韓國日本香港VPS主機優惠碼(持續更新)

推薦輕量高效無依賴的開源JS插件和庫

vscode 修改注釋顏色

CSS的居中方式

微信樣式組件遇到的小坑

通用 CSS 筆記、建議與指導

在JavaScript中理解策略模式

最新活動

資訊專欄INFORMATION COLUMN

上云采購季！| 2核2G4M爆款云服務器低至59元/年，更有多臺、長期優惠，快來選購！

使用Prometheus+Alertmanager告警JVM異常情況

Summary

Description

相關文章

發表評論

0條評論

男|高級講師

TA的文章

最新活動

上云采購季！| 2核2G4M爆款云服務器低至59元/年，更有多臺、長期優惠，快來選購！