摘要:監(jiān)控告警原型圖原型圖解釋與作為運(yùn)行在同一個(gè)中并交由控制器管理,默認(rèn)開啟端口,因?yàn)槲覀兊呐c是處于同一個(gè)中,所以直接使用就可以與通信用于發(fā)送告警通知,告警規(guī)則配置以的形式掛載到容器供使用,告警通知對(duì)象配置也通過掛載到容器供使用,這里我們使用郵件
監(jiān)控告警原型圖 原型圖解釋
prometheus與alertmanager作為container運(yùn)行在同一個(gè)pods中并交由Deployment控制器管理,alertmanager默認(rèn)開啟9093端口,因?yàn)槲覀兊膒rometheus與alertmanager是處于同一個(gè)pod中,所以prometheus直接使用localhost:9093就可以與alertmanager通信(用于發(fā)送告警通知),告警規(guī)則配置rules.yml以Configmap的形式掛載到prometheus容器供prometheus使用,告警通知對(duì)象配置也通過Configmap掛載到alertmanager容器供alertmanager使用,這里我們使用郵件接收告警通知,具體配置在alertmanager.yml中
測(cè)試環(huán)境環(huán)境:Linux 3.10.0-693.el7.x86_64 x86_64 GNU/Linux
平臺(tái):Kubernetes v1.10.5
Tips:prometheus與alertmanager完整的配置在文檔末尾
在prometheus中指定告警規(guī)則的路徑, rules.yml就是用來指定報(bào)警規(guī)則,這里我們將rules.yml用ConfigMap的形式掛載到/etc/prometheus目錄下面即可:
rule_files: - /etc/prometheus/rules.yml
這里我們指定了一個(gè)InstanceDown告警,當(dāng)主機(jī)掛掉1分鐘則prometheus會(huì)發(fā)出告警
rules.yml: | groups: - name: example rules: - alert: InstanceDown expr: up == 0 for: 1m labels: severity: page annotations: summary: "Instance {{ $labels.instance }} down" description: "{{ $labels.instance }} of job {{ $labels.job }} has been down for more than 1 minutes."配置prometheus與alertmanager通信(用于prometheus向alertmanager發(fā)送告警信息)
alertmanager默認(rèn)開啟9093端口,又因?yàn)槲覀兊膒rometheus與alertmanager是處于同一個(gè)pod中,所以prometheus直接使用localhost:9093就可以與alertmanager通信
alerting: alertmanagers: - static_configs: - targets: ["localhost:9093"]alertmanager配置告警通知對(duì)象
我們這里舉了一個(gè)郵件告警的例子,alertmanager接收到prometheus發(fā)出的告警時(shí),alertmanager會(huì)向指定的郵箱發(fā)送一封告警郵件,這個(gè)配置也是通過Configmap的形式掛載到alertmanager所在的容器中供alertmanager使用
alertmanager.yml: |- global: smtp_smarthost: "smtp.exmail.qq.com:465" smtp_from: "xin.liu@woqutech.com" smtp_auth_username: "xin.liu@woqutech.com" smtp_auth_password: "xxxxxxxxxxxx" smtp_require_tls: false route: group_by: [alertname] group_wait: 30s group_interval: 5m repeat_interval: 10m receiver: default-receiver receivers: - name: "default-receiver" email_configs: - to: "1148576125@qq.com"原型效果展示
在prometheus web ui中可以看到配置的告警規(guī)則
為了看測(cè)試效果,關(guān)掉一個(gè)主機(jī)節(jié)點(diǎn):
在prometheus web ui中可以看到一個(gè)InstanceDown告警被觸發(fā)
在alertmanager web ui中可以看到alertmanager收到prometheus發(fā)出的告警
指定接收告警的郵箱收到alertmanager發(fā)出的告警郵件
全部配置node_exporter_daemonset.yaml
apiVersion: extensions/v1beta1 kind: DaemonSet metadata: name: node-exporter namespace: kube-system labels: app: node_exporter spec: selector: matchLabels: name: node_exporter template: metadata: labels: name: node_exporter spec: tolerations: - key: node-role.kubernetes.io/master effect: NoSchedule containers: - name: node-exporter image: alery/node-exporter:1.0 ports: - name: node-exporter containerPort: 9100 hostPort: 9100 volumeMounts: - name: localtime mountPath: /etc/localtime - name: host mountPath: /host readOnly: true volumes: - name: localtime hostPath: path: /usr/share/zoneinfo/Asia/Shanghai - name: host hostPath: path: /
alertmanager-cm.yaml
kind: ConfigMap apiVersion: v1 metadata: name: alertmanager namespace: kube-system data: alertmanager.yml: |- global: smtp_smarthost: "smtp.exmail.qq.com:465" smtp_from: "xin.liu@woqutech.com" smtp_auth_username: "xin.liu@woqutech.com" smtp_auth_password: "xxxxxxxxxxxx" smtp_require_tls: false route: group_by: [alertname] group_wait: 30s group_interval: 5m repeat_interval: 10m receiver: default-receiver receivers: - name: "default-receiver" email_configs: - to: "1148576125@qq.com"
prometheus-rbac.yaml
apiVersion: rbac.authorization.k8s.io/v1beta1 kind: ClusterRole metadata: name: prometheus namespace: kube-system rules: - apiGroups: [""] resources: - nodes - nodes/proxy - services - endpoints - pods verbs: ["get", "list", "watch"] - nonResourceURLs: ["/metrics"] verbs: ["get"] --- apiVersion: v1 kind: ServiceAccount metadata: name: prometheus namespace: kube-system --- apiVersion: rbac.authorization.k8s.io/v1beta1 kind: ClusterRoleBinding metadata: name: prometheus namespace: kube-system roleRef: apiGroup: rbac.authorization.k8s.io kind: ClusterRole name: prometheus subjects: - kind: ServiceAccount name: prometheus namespace: kube-system
prometheus-cm.yaml
kind: ConfigMap apiVersion: v1 data: prometheus.yml: | rule_files: - /etc/prometheus/rules.yml alerting: alertmanagers: - static_configs: - targets: ["localhost:9093"] scrape_configs: - job_name: "node" kubernetes_sd_configs: - role: pod relabel_configs: - source_labels: [__meta_kubernetes_pod_ip] action: replace target_label: __address__ replacement: $1:9100 - source_labels: [__meta_kubernetes_pod_host_ip] action: replace target_label: instance - source_labels: [__meta_kubernetes_pod_node_name] action: replace target_label: node_name - action: labelmap regex: __meta_kubernetes_pod_label_(name) - source_labels: [__meta_kubernetes_pod_label_name] regex: node_exporter action: keep rules.yml: | groups: - name: example rules: - alert: InstanceDown expr: up == 0 for: 5m labels: severity: page annotations: summary: "Instance {{ $labels.instance }} down" description: "{{ $labels.instance }} of job {{ $labels.job }} has been down for more than 5 minutes." - alert: APIHighRequestLatency expr: api_http_request_latencies_second{quantile="0.5"} > 1 for: 10m annotations: summary: "High request latency on {{ $labels.instance }}" description: "{{ $labels.instance }} has a median request latency above 1s (current value: {{ $value }}s)" metadata: name: prometheus-config-v0.1.0 namespace: kube-system
prometheus.yaml
apiVersion: extensions/v1beta1 kind: Deployment metadata: namespace: kube-system name: prometheus labels: name: prometheus spec: replicas: 1 selector: matchLabels: app: prometheus template: metadata: name: prometheus labels: app: prometheus spec: serviceAccountName: prometheus nodeSelector: node-role.kubernetes.io/master: "" tolerations: - effect: NoSchedule key: node-role.kubernetes.io/master operator: Exists securityContext: runAsUser: 0 fsGroup: 0 containers: - name: prometheus image: prom/prometheus:v2.4.0 args: - "--config.file=/etc/prometheus/prometheus.yml" ports: - name: web containerPort: 9090 volumeMounts: - name: prometheus-config mountPath: /etc/prometheus - name: prometheus-storage mountPath: /prometheus - name: localtime mountPath: /etc/localtime - name: alertmanager image: prom/alertmanager:v0.14.0 args: - "--config.file=/etc/alertmanager/alertmanager.yml" - "--log.level=debug" ports: - containerPort: 9093 protocol: TCP name: alertmanager volumeMounts: - name: alertmanager-config mountPath: /etc/alertmanager - name: alertmanager-storage mountPath: /alertmanager - name: localtime mountPath: /etc/localtime volumes: - name: prometheus-config configMap: name: prometheus-config-v0.1.0 - name: alertmanager-config configMap: name: alertmanager - name: localtime hostPath: path: /usr/share/zoneinfo/Asia/Shanghai - name: prometheus-storage hostPath: path: /gaea/prometheus type: DirectoryOrCreate - name: alertmanager-storage hostPath: path: /gaea/alertmanager type: DirectoryOrCreate --- apiVersion: v1 kind: Service metadata: labels: name: prometheus kubernetes.io/cluster-service: "true" name: prometheus namespace: kube-system spec: ports: - name: prometheus nodePort: 30065 port: 9090 protocol: TCP targetPort: 9090 selector: app: prometheus sessionAffinity: None type: NodePort --- apiVersion: v1 kind: Service metadata: labels: name: prometheus kubernetes.io/cluster-service: "true" name: alertmanager namespace: kube-system spec: ports: - name: alertmanager nodePort: 30066 port: 9093 protocol: TCP targetPort: 9093 selector: app: prometheus sessionAffinity: None type: NodePort
文章版權(quán)歸作者所有,未經(jīng)允許請(qǐng)勿轉(zhuǎn)載,若此文章存在違規(guī)行為,您可以聯(lián)系管理員刪除。
轉(zhuǎn)載請(qǐng)注明本文地址:http://specialneedsforspecialkids.com/yun/32736.html
摘要:同時(shí)有權(quán)限控制日志審計(jì)整體配置過期時(shí)間等功能。將成為趨勢(shì)前置條件要求的版本應(yīng)該是因?yàn)楹椭С值南拗频暮诵乃枷胧菍⒌牟渴鹋c它監(jiān)控的對(duì)象的配置分離,做到部署與監(jiān)控對(duì)象的配置分離之后,就可以輕松實(shí)現(xiàn)動(dòng)態(tài)配置。 一.單獨(dú)部署 二進(jìn)制安裝各版本下載地址:https://prometheus.io/download/ Docker運(yùn)行 運(yùn)行命令:docker run --name promet...
摘要:集群三步安裝概述應(yīng)當(dāng)是使用監(jiān)控系統(tǒng)的最佳實(shí)踐了,首先它一鍵構(gòu)建整個(gè)監(jiān)控系統(tǒng),通過一些無侵入的手段去配置如監(jiān)控?cái)?shù)據(jù)源等故障自動(dòng)恢復(fù),高可用的告警等。。 kubernetes集群三步安裝 概述 prometheus operator應(yīng)當(dāng)是使用監(jiān)控系統(tǒng)的最佳實(shí)踐了,首先它一鍵構(gòu)建整個(gè)監(jiān)控系統(tǒng),通過一些無侵入的手段去配置如監(jiān)控?cái)?shù)據(jù)源等故障自動(dòng)恢復(fù),高可用的告警等。。 不過對(duì)于新手使用上還是有一...
摘要:集群三步安裝概述應(yīng)當(dāng)是使用監(jiān)控系統(tǒng)的最佳實(shí)踐了,首先它一鍵構(gòu)建整個(gè)監(jiān)控系統(tǒng),通過一些無侵入的手段去配置如監(jiān)控?cái)?shù)據(jù)源等故障自動(dòng)恢復(fù),高可用的告警等。。 kubernetes集群三步安裝 概述 prometheus operator應(yīng)當(dāng)是使用監(jiān)控系統(tǒng)的最佳實(shí)踐了,首先它一鍵構(gòu)建整個(gè)監(jiān)控系統(tǒng),通過一些無侵入的手段去配置如監(jiān)控?cái)?shù)據(jù)源等故障自動(dòng)恢復(fù),高可用的告警等。。 不過對(duì)于新手使用上還是有一...
閱讀 3077·2023-04-26 00:53
閱讀 3522·2021-11-19 09:58
閱讀 1693·2021-09-29 09:35
閱讀 3279·2021-09-28 09:46
閱讀 3851·2021-09-22 15:38
閱讀 2692·2019-08-30 15:55
閱讀 3006·2019-08-23 14:10
閱讀 3822·2019-08-22 18:17