Kubernetes中使用prometheus+alertmanager實(shí)現(xiàn)監(jiān)控告警

wupengyu 發(fā)布于2019-07-01 16:50 / 3787人閱讀

摘要：監(jiān)控告警原型圖原型圖解釋與作為運(yùn)行在同一個(gè)中并交由控制器管理，默認(rèn)開啟端口，因?yàn)槲覀兊呐c是處于同一個(gè)中，所以直接使用就可以與通信用于發(fā)送告警通知，告警規(guī)則配置以的形式掛載到容器供使用，告警通知對(duì)象配置也通過掛載到容器供使用，這里我們使用郵件

監(jiān)控告警原型圖

原型圖解釋

prometheus與alertmanager作為container運(yùn)行在同一個(gè)pods中并交由Deployment控制器管理，alertmanager默認(rèn)開啟9093端口，因?yàn)槲覀兊膒rometheus與alertmanager是處于同一個(gè)pod中，所以prometheus直接使用localhost:9093就可以與alertmanager通信(用于發(fā)送告警通知)，告警規(guī)則配置rules.yml以Configmap的形式掛載到prometheus容器供prometheus使用，告警通知對(duì)象配置也通過Configmap掛載到alertmanager容器供alertmanager使用，這里我們使用郵件接收告警通知，具體配置在alertmanager.yml中

測(cè)試環(huán)境

環(huán)境：Linux 3.10.0-693.el7.x86_64 x86_64 GNU/Linux
平臺(tái)：Kubernetes v1.10.5
Tips：prometheus與alertmanager完整的配置在文檔末尾

創(chuàng)建告警規(guī)則

在prometheus中指定告警規(guī)則的路徑， rules.yml就是用來指定報(bào)警規(guī)則，這里我們將rules.yml用ConfigMap的形式掛載到/etc/prometheus目錄下面即可:

rule_files:
- /etc/prometheus/rules.yml

這里我們指定了一個(gè)InstanceDown告警，當(dāng)主機(jī)掛掉1分鐘則prometheus會(huì)發(fā)出告警

  rules.yml: |
    groups:
    - name: example
      rules:
      - alert: InstanceDown
        expr: up == 0
        for: 1m
        labels:
          severity: page
        annotations:
          summary: "Instance {{ $labels.instance }} down"
          description: "{{ $labels.instance }} of job {{ $labels.job }} has been down for more than 1 minutes."

配置prometheus與alertmanager通信(用于prometheus向alertmanager發(fā)送告警信息)

alertmanager默認(rèn)開啟9093端口，又因?yàn)槲覀兊膒rometheus與alertmanager是處于同一個(gè)pod中，所以prometheus直接使用localhost:9093就可以與alertmanager通信

alerting:
  alertmanagers:
  - static_configs:
    - targets: ["localhost:9093"]

alertmanager配置告警通知對(duì)象

我們這里舉了一個(gè)郵件告警的例子，alertmanager接收到prometheus發(fā)出的告警時(shí)，alertmanager會(huì)向指定的郵箱發(fā)送一封告警郵件，這個(gè)配置也是通過Configmap的形式掛載到alertmanager所在的容器中供alertmanager使用

alertmanager.yml: |-
    global:
      smtp_smarthost: "smtp.exmail.qq.com:465"
      smtp_from: "xin.liu@woqutech.com"
      smtp_auth_username: "xin.liu@woqutech.com"
      smtp_auth_password: "xxxxxxxxxxxx"
      smtp_require_tls: false
    route:
      group_by: [alertname]
      group_wait: 30s
      group_interval: 5m
      repeat_interval: 10m
      receiver: default-receiver
    receivers:
    - name: "default-receiver"
      email_configs:
      - to: "1148576125@qq.com"

原型效果展示

在prometheus web ui中可以看到配置的告警規(guī)則

為了看測(cè)試效果，關(guān)掉一個(gè)主機(jī)節(jié)點(diǎn)：
在prometheus web ui中可以看到一個(gè)InstanceDown告警被觸發(fā)

在alertmanager web ui中可以看到alertmanager收到prometheus發(fā)出的告警

指定接收告警的郵箱收到alertmanager發(fā)出的告警郵件

全部配置

node_exporter_daemonset.yaml

apiVersion: extensions/v1beta1
kind: DaemonSet
metadata:
  name: node-exporter
  namespace: kube-system
  labels:
    app: node_exporter
spec:
  selector:
    matchLabels:
      name: node_exporter
  template:
    metadata:
      labels:
        name: node_exporter
    spec:
      tolerations:
      - key: node-role.kubernetes.io/master
        effect: NoSchedule
      containers:
      - name: node-exporter
        image: alery/node-exporter:1.0
        ports:
        - name: node-exporter
          containerPort: 9100
          hostPort: 9100
        volumeMounts:
        - name: localtime
          mountPath: /etc/localtime
        - name: host
          mountPath: /host
          readOnly: true
      volumes:
      - name: localtime
        hostPath:
          path: /usr/share/zoneinfo/Asia/Shanghai
      - name: host
        hostPath:
          path: /

alertmanager-cm.yaml

kind: ConfigMap
apiVersion: v1
metadata:
  name: alertmanager
  namespace: kube-system
data:
  alertmanager.yml: |-
    global:
      smtp_smarthost: "smtp.exmail.qq.com:465"
      smtp_from: "xin.liu@woqutech.com"
      smtp_auth_username: "xin.liu@woqutech.com"
      smtp_auth_password: "xxxxxxxxxxxx"
      smtp_require_tls: false
    route:
      group_by: [alertname]
      group_wait: 30s
      group_interval: 5m
      repeat_interval: 10m
      receiver: default-receiver
    receivers:
    - name: "default-receiver"
      email_configs:
      - to: "1148576125@qq.com"

prometheus-rbac.yaml

apiVersion: rbac.authorization.k8s.io/v1beta1
kind: ClusterRole
metadata:
  name: prometheus
  namespace: kube-system
rules:
- apiGroups: [""]
  resources:
  - nodes
  - nodes/proxy
  - services
  - endpoints
  - pods
  verbs: ["get", "list", "watch"]
- nonResourceURLs: ["/metrics"]
  verbs: ["get"]
---
apiVersion: v1
kind: ServiceAccount
metadata:
  name: prometheus
  namespace: kube-system
---
apiVersion: rbac.authorization.k8s.io/v1beta1
kind: ClusterRoleBinding
metadata:
  name: prometheus
  namespace: kube-system
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: prometheus
subjects:
- kind: ServiceAccount
  name: prometheus
  namespace: kube-system

prometheus-cm.yaml

kind: ConfigMap
apiVersion: v1
data:
  prometheus.yml: |
    rule_files:
    - /etc/prometheus/rules.yml
    alerting:
      alertmanagers:
      - static_configs:
        - targets: ["localhost:9093"]
    scrape_configs:
    - job_name: "node"
      kubernetes_sd_configs:
      - role: pod
      relabel_configs:
      - source_labels: [__meta_kubernetes_pod_ip]
        action: replace
        target_label: __address__
        replacement: $1:9100
      - source_labels: [__meta_kubernetes_pod_host_ip]
        action: replace
        target_label: instance
      - source_labels: [__meta_kubernetes_pod_node_name]
        action: replace
        target_label: node_name
      - action: labelmap
        regex: __meta_kubernetes_pod_label_(name)
      - source_labels: [__meta_kubernetes_pod_label_name]
        regex: node_exporter
        action: keep

  rules.yml: |
    groups:
    - name: example
      rules:
      - alert: InstanceDown
        expr: up == 0
        for: 5m
        labels:
          severity: page
        annotations:
          summary: "Instance {{ $labels.instance }} down"
          description: "{{ $labels.instance }} of job {{ $labels.job }} has been down for more than 5 minutes."
      - alert: APIHighRequestLatency
        expr: api_http_request_latencies_second{quantile="0.5"} > 1
        for: 10m
        annotations:
          summary: "High request latency on {{ $labels.instance }}"
          description: "{{ $labels.instance }} has a median request latency above 1s (current value: {{ $value }}s)"

metadata:
  name: prometheus-config-v0.1.0
  namespace: kube-system

prometheus.yaml

apiVersion: extensions/v1beta1
kind: Deployment
metadata:
  namespace: kube-system
  name: prometheus
  labels:
    name: prometheus
spec:
  replicas: 1
  selector:
    matchLabels:
      app: prometheus
  template:
    metadata:
      name: prometheus
      labels:
        app: prometheus
    spec:
      serviceAccountName: prometheus
      nodeSelector:
        node-role.kubernetes.io/master: ""
      tolerations:
      - effect: NoSchedule
        key: node-role.kubernetes.io/master
        operator: Exists
      securityContext:
        runAsUser: 0
        fsGroup: 0
      containers:
      - name: prometheus
        image: prom/prometheus:v2.4.0
        args:
        - "--config.file=/etc/prometheus/prometheus.yml"
        ports:
        - name: web
          containerPort: 9090
        volumeMounts:
        - name: prometheus-config
          mountPath: /etc/prometheus
        - name: prometheus-storage
          mountPath: /prometheus
        - name: localtime
          mountPath: /etc/localtime
      - name: alertmanager
        image: prom/alertmanager:v0.14.0
        args:
        - "--config.file=/etc/alertmanager/alertmanager.yml"
        - "--log.level=debug"
        ports:
        - containerPort: 9093
          protocol: TCP
          name: alertmanager
        volumeMounts:
        - name: alertmanager-config
          mountPath: /etc/alertmanager
        - name: alertmanager-storage
          mountPath: /alertmanager
        - name: localtime
          mountPath: /etc/localtime
      volumes:
      - name: prometheus-config
        configMap:
          name: prometheus-config-v0.1.0
      - name: alertmanager-config
        configMap:
          name: alertmanager
      - name: localtime
        hostPath:
          path: /usr/share/zoneinfo/Asia/Shanghai
      - name: prometheus-storage
        hostPath:
          path: /gaea/prometheus
          type: DirectoryOrCreate
      - name: alertmanager-storage
        hostPath:
          path: /gaea/alertmanager
          type: DirectoryOrCreate
---
apiVersion: v1
kind: Service
metadata:
  labels:
    name: prometheus
    kubernetes.io/cluster-service: "true"
  name: prometheus
  namespace: kube-system
spec:
  ports:
  - name: prometheus
    nodePort: 30065
    port: 9090
    protocol: TCP
    targetPort: 9090
  selector:
    app: prometheus
  sessionAffinity: None
  type: NodePort
---
apiVersion: v1
kind: Service
metadata:
  labels:
    name: prometheus
    kubernetes.io/cluster-service: "true"
  name: alertmanager
  namespace: kube-system
spec:
  ports:
  - name: alertmanager
    nodePort: 30066
    port: 9093
    protocol: TCP
    targetPort: 9093
  selector:
    app: prometheus
  sessionAffinity: None
  type: NodePort

GPU云服務(wù)器云服務(wù)器云監(jiān)控服務(wù)器中告警狀態(tài)有哪些告警監(jiān)控監(jiān)控告警云監(jiān)控告警短信

文章版權(quán)歸作者所有，未經(jīng)允許請(qǐng)勿轉(zhuǎn)載,若此文章存在違規(guī)行為，您可以聯(lián)系管理員刪除。

轉(zhuǎn)載請(qǐng)注明本文地址：http://specialneedsforspecialkids.com/yun/32736.html

發(fā)表評(píng)論

登陸后可評(píng)論

0條評(píng)論

wupengyu

男|高級(jí)講師

我要關(guān)注我要私信

TA的文章

tensorflow

閱讀 3077·2023-04-26 00:53
騰訊云基礎(chǔ)網(wǎng)絡(luò)產(chǎn)品下線通知：2022年1月31日停止基礎(chǔ)網(wǎng)絡(luò)產(chǎn)品創(chuàng)建

閱讀 3522·2021-11-19 09:58
工作5年后我才發(fā)現(xiàn)：90%的技術(shù)問題，可以解決

閱讀 1693·2021-09-29 09:35
RangCloud：慶國(guó)慶活動(dòng),香港CN2+BGP線路VPS七折優(yōu)惠;1核/1G套餐月付13.8元起

閱讀 3279·2021-09-28 09:46
如何搭建虛擬主機(jī)-如何在虛擬主機(jī)上搭建一個(gè)網(wǎng)站？

閱讀 3851·2021-09-22 15:38
前端常用屬性/方法/命令積累

閱讀 2692·2019-08-30 15:55
基本排序算法

閱讀 3006·2019-08-23 14:10
isNaN的理解

閱讀 3822·2019-08-22 18:17

国产xxxx99真实实拍_久久不雅视频_高清韩国a级特黄毛片_嗯老师别我我受不了了小说

資訊專欄INFORMATION COLUMN

上云采購(gòu)季！| 2核2G4M爆款云服務(wù)器低至59元/年，更有多臺(tái)、長(zhǎng)期優(yōu)惠，快來選購(gòu)！

Kubernetes中使用prometheus+alertmanager實(shí)現(xiàn)監(jiān)控告警

相關(guān)文章

**容器監(jiān)控實(shí)踐—Prometheus部署方案**

**使用prometheus operator監(jiān)控envoy**

**使用prometheus operator監(jiān)控envoy**

發(fā)表評(píng)論

0條評(píng)論

wupengyu

男|高級(jí)講師

TA的文章

tensorflow

騰訊云基礎(chǔ)網(wǎng)絡(luò)產(chǎn)品下線通知：2022年1月31日停止基礎(chǔ)網(wǎng)絡(luò)產(chǎn)品創(chuàng)建

工作5年后我才發(fā)現(xiàn)：90%的技術(shù)問題，可以解決

RangCloud：慶國(guó)慶活動(dòng),香港CN2+BGP線路VPS七折優(yōu)惠;1核/1G套餐月付13.8元起

如何搭建虛擬主機(jī)-如何在虛擬主機(jī)上搭建一個(gè)網(wǎng)站？

前端常用屬性/方法/命令積累

基本排序算法

isNaN的理解

最新活動(dòng)