容器監控實踐—Prometheus的配置與服務發現

longshengwang 發布于2019-07-01 17:36 / 1742人閱讀

摘要：一概述的配置可以用命令行參數或者配置文件，如果是在集群內，一般配置在中以下均為版本查看可用的命令行參數，可以執行也可以指定對應的配置文件，參數一般為如果配置有修改，如增添采集，可以重新加載它的配置。目前主要支持種服務發現模式，分別是。

本文將分析Prometheus的常見配置與服務發現，分為概述、配置詳解、服務發現、常見場景四個部分進行講解。

一. 概述

Prometheus的配置可以用命令行參數、或者配置文件，如果是在k8s集群內，一般配置在configmap中（以下均為prometheus2.7版本）

查看可用的命令行參數，可以執行 ./prometheus -h

也可以指定對應的配置文件，參數：--config.file 一般為prometheus.yml

如果配置有修改，如增添采集job，Prometheus可以重新加載它的配置。只需要向其
進程發送SIGHUP或向/-/reload端點發送HTTP POST請求。如：

curl -X POST http://localhost:9090/-/reload

二. 配置詳解

2.1 命令行參數

執行./prometheus -h 可以看到各個參數的含義，例如：

--web.listen-address="0.0.0.0:9090"   監聽端口默認為9090，可以修改只允許本機訪問，或者為了安全起見，可以改變其端口號（默認的web服務沒有鑒權）

--web.max-connections=512  默認最大連接數：512

--storage.tsdb.path="data/"  默認的存儲路徑：data目錄下

--storage.tsdb.retention.time=15d  默認的數據保留時間：15天。原有的storage.tsdb.retention配置已經被廢棄

--alertmanager.timeout=10s  把報警發送給alertmanager的超時限制 10s

--query.timeout=2m  查詢超時時間限制默認為2min，超過自動被kill掉。可以結合grafana的限時配置如60s

--query.max-concurrency=20 并發查詢數 prometheus的默認采集指標中有一項prometheus_engine_queries_concurrent_max可以拿到最大查詢并發數及查詢情況

--log.level=info 日志打印等級一共四種：[debug, info, warn, error]，如果調試屬性可以先改為debug等級

.....

在prometheus的頁面上，status的Command-Line Flags中，可以看到當前配置，如promethues-operator的配置是：

2.2 prometheus.yml

從官方的download頁下載的promethues二進制文件，會自帶一份默認配置prometheus.yml

-rw-r--r--@ LICENSE
-rw-r--r--@ NOTICE
drwxr-xr-x@ console_libraries
drwxr-xr-x@ consoles
-rwxr-xr-x@ prometheus
-rw-r--r--@ prometheus.yml
-rwxr-xr-x@ promtool

prometheus.yml配置了很多屬性，包括遠程存儲、報警配置等很多內容，下面將對主要屬性進行解釋：

# 默認的全局配置
global:
  scrape_interval:     15s # 采集間隔15s，默認為1min一次
  evaluation_interval: 15s # 計算規則的間隔15s默認為1min一次
  scrape_timeout: 10s # 采集超時時間，默認為10s
  external_labels:  # 當和其他外部系統交互時的標簽，如遠程存儲、聯邦集群時
    prometheus: monitoring/k8s  # 如：prometheus-operator的配置
    prometheus_replica: prometheus-k8s-1

# Alertmanager的配置
alerting:
  alertmanagers:
  - static_configs:
    - targets:
      - 127.0.0.1:9093  # alertmanager的服務地址，如127.0.0.1:9093
  alert_relabel_configs: # 在抓取之前對任何目標及其標簽進行修改。 
  - separator: ;
    regex: prometheus_replica
    replacement: $1
    action: labeldrop 

# 一旦加載了報警規則文件，將按照evaluation_interval即15s一次進行計算，rule文件可以有多個
rule_files:
  # - "first_rules.yml"
  # - "second_rules.yml"

# scrape_configs為采集配置，包含至少一個job

scrape_configs:
  # Prometheus的自身監控 將在采集到的時間序列數據上打上標簽job=xx
  - job_name: "prometheus"
    # 采集指標的默認路徑為：/metrics，如 localhost:9090/metric
    # 協議默認為http
    static_configs:
    - targets: ["localhost:9090"]

# 遠程讀，可選配置，如將監控數據遠程讀寫到influxdb的地址，默認為本地讀寫
remote_write:
  127.0.0.1:8090

# 遠程寫
remote_read:
  127.0.0.1:8090

2.3 scrape_configs配置

prometheus的配置中，最常用的就是scrape_configs配置，比如添加新的監控項，修改原有監控項的地址頻率等。

最簡單配置為：

scrape_configs:
- job_name: prometheus
  metrics_path: /metrics
  scheme: http
  static_configs:
  - targets:
    - localhost:9090

完整配置為(附prometheus-operator的推薦配置)：

# job 將以標簽形式出現在指標數據中，如node-exporter采集的數據，job=node-exporter
job_name: node-exporter

# 采集頻率：30s
scrape_interval: 30s

# 采集超時：10s
scrape_timeout: 10s

# 采集對象的path路徑
metrics_path: /metrics

# 采集協議：http或者https
scheme: https

# 可選的采集url的參數
params:
  name: demo

# 當自定義label和采集到的自帶label沖突時的處理方式，默認沖突時會重名為exported_xx
honor_labels: false


# 當采集對象需要鑒權才能獲取時，配置賬號密碼等信息
basic_auth:
  username: admin
  password: admin
  password_file: /etc/pwd

# bearer_token或者文件位置(OAuth 2.0鑒權)
bearer_token: kferkhjktdgjwkgkrwg
bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token

# https的配置，如跳過認證，或配置證書文件
tls_config:
  # insecure_skip_verify: true
  ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
  server_name: kubernetes
  insecure_skip_verify: false

# 代理地址
proxy_url: 127.9.9.0:9999

# Azure的服務發現配置
azure_sd_configs:

# Consul的服務發現配置
consul_sd_configs:
 
# DNS的服務發現配置
dns_sd_configs:

# EC2的服務發現配置
ec2_sd_configs:

# OpenStack的服務發現配置
openstack_sd_configs:

# file的服務發現配置
file_sd_configs:

# GCE的服務發現配置
gce_sd_configs:

# Marathon的服務發現配置
marathon_sd_configs:

# AirBnB的服務發現配置
nerve_sd_configs:

# Zookeeper的服務發現配置
serverset_sd_configs:

# Triton的服務發現配置
triton_sd_configs:

# Kubernetes的服務發現配置
kubernetes_sd_configs:
 - role: endpoints
    namespaces:
      names:
      - monitoring

# 對采集對象進行一些靜態配置，如打特定的標簽
static_configs:
  - targets: ["localhost:9090", "localhost:9191"]
    labels:
      my:   label
      your: label
      
# 在Prometheus采集數據之前，通過Target實例的Metadata信息，動態重新寫入Label的值。
如將原始的__meta_kubernetes_namespace直接寫成namespace，簡潔明了

relabel_configs:
  - source_labels: [__meta_kubernetes_namespace]
    separator: ;
    regex: (.*)
    target_label: namespace
    replacement: $1
    action: replace
  - source_labels: [__meta_kubernetes_service_name]
    separator: ;
    regex: (.*)
    target_label: service
    replacement: $1
    action: replace
  - source_labels: [__meta_kubernetes_pod_name]
    separator: ;
    regex: (.*)
    target_label: pod
    replacement: $1
    action: replace
  - source_labels: [__meta_kubernetes_service_name]
    separator: ;
    regex: (.*)
    target_label: job
    replacement: ${1}
    action: replace
  - separator: ;
    regex: (.*)
    target_label: endpoint
    replacement: web
    action: replace

# 指標relabel的配置，如丟掉某些無用的指標
metric_relabel_configs:
  - source_labels: [__name__]
    separator: ;
    regex: etcd_(debugging|disk|request|server).*
    replacement: $1
    action: drop
   
# 限制最大采集樣本數，超過了采集將會失敗，默認為0不限制
sample_limit: 0

三. 服務發現

上邊的配置文件中，有很多*_sd_configs的配置，如kubernetes_sd_configs，就是用于服務發現的采集配置。

支持的服務發現類型：

// prometheus/discovery/config/config.go
type ServiceDiscoveryConfig struct {
    StaticConfigs []*targetgroup.Group `yaml:"static_configs,omitempty"`
    DNSSDConfigs []*dns.SDConfig `yaml:"dns_sd_configs,omitempty"`
    FileSDConfigs []*file.SDConfig `yaml:"file_sd_configs,omitempty"`
    ConsulSDConfigs []*consul.SDConfig `yaml:"consul_sd_configs,omitempty"`
    ServersetSDConfigs []*zookeeper.ServersetSDConfig `yaml:"serverset_sd_configs,omitempty"`
    NerveSDConfigs []*zookeeper.NerveSDConfig `yaml:"nerve_sd_configs,omitempty"`
    MarathonSDConfigs []*marathon.SDConfig `yaml:"marathon_sd_configs,omitempty"`
    KubernetesSDConfigs []*kubernetes.SDConfig `yaml:"kubernetes_sd_configs,omitempty"`
    GCESDConfigs []*gce.SDConfig `yaml:"gce_sd_configs,omitempty"`
    EC2SDConfigs []*ec2.SDConfig `yaml:"ec2_sd_configs,omitempty"`
    OpenstackSDConfigs []*openstack.SDConfig `yaml:"openstack_sd_configs,omitempty"`
    AzureSDConfigs []*azure.SDConfig `yaml:"azure_sd_configs,omitempty"`
    TritonSDConfigs []*triton.SDConfig `yaml:"triton_sd_configs,omitempty"`
}

因為prometheus采用的是pull方式來拉取監控數據，這種方式需要由server側決定采集的目標有哪些，即配置在scrape_configs中的各種job，pull方式的主要缺點就是無法動態感知新服務的加入，因此大多數監控都默認支持服務發現機制，自動發現集群中的新端點，并加入到配置中。

Prometheus支持多種服務發現機制：文件，DNS，Consul,Kubernetes,OpenStack,EC2等等。基于服務發現的過程并不復雜，通過第三方提供的接口，Prometheus查詢到需要監控的Target列表，然后輪詢這些Target獲取監控數據。

對于kubernetes而言，Promethues通過與Kubernetes API交互，然后輪詢資源端點。目前主要支持5種服務發現模式，分別是：Node、Service、Pod、Endpoints、Ingress。對應配置文件中的role: node/role:service

如：動態獲取所有節點node的信息，可以添加如下配置：

- job_name: kubernetes-nodes
  scrape_interval: 1m
  scrape_timeout: 10s
  metrics_path: /metrics
  scheme: https
  kubernetes_sd_configs:
  - api_server: null
    role: node
    namespaces:
      names: []
  bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
  tls_config:
    ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
    insecure_skip_verify: true
  relabel_configs:
  - separator: ;
    regex: __meta_kubernetes_node_label_(.+)
    replacement: $1
    action: labelmap
  - separator: ;
    regex: (.*)
    target_label: __address__
    replacement: kubernetes.default.svc:443
    action: replace
  - source_labels: [__meta_kubernetes_node_name]
    separator: ;
    regex: (.+)
    target_label: __metrics_path__
    replacement: /api/v1/nodes/${1}/proxy/metrics
    action: replace

就可以在target中看到具體內容

對應的service、pod也是同樣的方式。

需要注意的是，為了能夠讓Prometheus能夠訪問收到Kubernetes API，我們要對Prometheus進行訪問授權，即serviceaccount。否則就算配置了，也沒有權限獲取。

prometheus的權限配置是一組ClusterRole+ClusterRoleBinding+ServiceAccount，然后在deployment或statefulset中指定serviceaccount。

ClusterRole.yaml

apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  namespace: kube-system
  name: prometheus
rules:
- apiGroups: [""]
  resources:
  - configmaps
  - secrets
  - nodes
  - pods
  - nodes/proxy
  - services
  - resourcequotas
  - replicationcontrollers
  - limitranges
  - persistentvolumeclaims
  - persistentvolumes
  - namespaces
  - endpoints
  verbs: ["get", "list", "watch"]
- apiGroups: ["extensions"]
  resources:
  - daemonsets
  - deployments
  - replicasets
  - ingresses
  verbs: ["get", "list", "watch"]
- apiGroups: ["apps"]
  resources:
  - daemonsets
  - deployments
  - replicasets
  - statefulsets
  verbs: ["get", "list", "watch"]
- apiGroups: ["batch"]
  resources:
  - cronjobs
  - jobs
  verbs: ["get", "list", "watch"]
- apiGroups: ["autoscaling"]
  resources:
  - horizontalpodautoscalers
  verbs: ["get", "list", "watch"]
- apiGroups: ["policy"]
  resources:
  - poddisruptionbudgets
  verbs: ["get", list", "watch"]
- nonResourceURLs: ["/metrics"]
  verbs: ["get"]

ClusterRoleBinding.yaml

apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  namespace: kube-system
  name: prometheus
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: prometheus
subjects:
- kind: ServiceAccount
  name: prometheus
  namespace: kube-system

ServiceAccount.yaml

apiVersion: v1
kind: ServiceAccount
metadata:
  namespace: kube-system
  name: prometheus

prometheus.yaml

....
spec:
  serviceAccountName: prometheus

....

完整的kubernete的配置如下：

- job_name: kubernetes-apiservers
  scrape_interval: 1m
  scrape_timeout: 10s
  metrics_path: /metrics
  scheme: https
  kubernetes_sd_configs:
  - api_server: null
    role: endpoints
    namespaces:
      names: []
  bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
  tls_config:
    ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
    insecure_skip_verify: true
  relabel_configs:
  - source_labels: [__meta_kubernetes_namespace, __meta_kubernetes_service_name, __meta_kubernetes_endpoint_port_name]
    separator: ;
    regex: default;kubernetes;https
    replacement: $1
    action: keep
- job_name: kubernetes-nodes
  scrape_interval: 1m
  scrape_timeout: 10s
  metrics_path: /metrics
  scheme: https
  kubernetes_sd_configs:
  - api_server: null
    role: node
    namespaces:
      names: []
  bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
  tls_config:
    ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
    insecure_skip_verify: true
  relabel_configs:
  - separator: ;
    regex: __meta_kubernetes_node_label_(.+)
    replacement: $1
    action: labelmap
  - separator: ;
    regex: (.*)
    target_label: __address__
    replacement: kubernetes.default.svc:443
    action: replace
  - source_labels: [__meta_kubernetes_node_name]
    separator: ;
    regex: (.+)
    target_label: __metrics_path__
    replacement: /api/v1/nodes/${1}/proxy/metrics
    action: replace
- job_name: kubernetes-cadvisor
  scrape_interval: 1m
  scrape_timeout: 10s
  metrics_path: /metrics
  scheme: https
  kubernetes_sd_configs:
  - api_server: null
    role: node
    namespaces:
      names: []
  bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
  tls_config:
    ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
    insecure_skip_verify: false
  relabel_configs:
  - separator: ;
    regex: __meta_kubernetes_node_label_(.+)
    replacement: $1
    action: labelmap
  - separator: ;
    regex: (.*)
    target_label: __address__
    replacement: kubernetes.default.svc:443
    action: replace
  - source_labels: [__meta_kubernetes_node_name]
    separator: ;
    regex: (.+)
    target_label: __metrics_path__
    replacement: /api/v1/nodes/${1}/proxy/metrics/cadvisor
    action: replace
- job_name: kubernetes-service-endpoints
  scrape_interval: 1m
  scrape_timeout: 10s
  metrics_path: /metrics
  scheme: http
  kubernetes_sd_configs:
  - api_server: null
    role: endpoints
    namespaces:
      names: []
  relabel_configs:
  - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scrape]
    separator: ;
    regex: "true"
    replacement: $1
    action: keep
  - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scheme]
    separator: ;
    regex: (https?)
    target_label: __scheme__
    replacement: $1
    action: replace
  - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_path]
    separator: ;
    regex: (.+)
    target_label: __metrics_path__
    replacement: $1
    action: replace
  - source_labels: [__address__, __meta_kubernetes_service_annotation_prometheus_io_port]
    separator: ;
    regex: ([^:]+)(?::d+)?;(d+)
    target_label: __address__
    replacement: $1:$2
    action: replace
  - separator: ;
    regex: __meta_kubernetes_service_label_(.+)
    replacement: $1
    action: labelmap
  - source_labels: [__meta_kubernetes_namespace]
    separator: ;
    regex: (.*)
    target_label: kubernetes_namespace
    replacement: $1
    action: replace
  - source_labels: [__meta_kubernetes_service_name]
    separator: ;
    regex: (.*)
    target_label: kubernetes_name
    replacement: $1
    action: replace

配置成功后，對應的target是：

四. 常見場景

1.獲取集群中各節點信息，并按可用區或地域分類

如使用k8s的role:node采集集群中node的數據，可以通過"meta_domain_beta_kubernetes_io_zone"標簽來獲取到該節點的地域，該label為集群創建時為node打上的標記，kubectl decribe node可以看到。

然后可以通過relabel_configs定義新的值

relabel_configs:
- source_labels:  ["meta_domain_beta_kubernetes_io_zone"]
  regex: "(.*)"
  replacement: $1
  action: replace
  target_label: "zone"

后面可以直接通過node{zone="XX"}來進行地域篩選

2.過濾信息，或者按照職能（RD、運維）進行監控管理

對于不同職能（開發、測試、運維）的人員可能只關心其中一部分的監控數據，他們可能各自部署的自己的Prometheus Server用于監控自己關心的指標數據，不必要的數據需要過濾掉，以免浪費資源，可以最類似配置;

metric_relabel_configs:
  - source_labels: [__name__]
    separator: ;
    regex: etcd_(debugging|disk|request|server).*
    replacement: $1
    action: drop

action: drop代表丟棄掉符合條件的指標，不進行采集。

3.搭建prometheus聯邦集群，管理各IDC（地域）監控實例

如果存在多個地域，每個地域又有很多節點或者集群，可以采用默認的聯邦集群部署，每個地域部署自己的prometheus server實例，采集自己地域的數據。然后由統一的server采集所有地域數據，進行統一展示，并按照地域歸類

配置：

scrape_configs:
  - job_name: "federate"
    scrape_interval: 15s
    honor_labels: true
    metrics_path: "/federate"
    params:
      "match[]":
        - "{job="prometheus"}"
        - "{__name__=~"job:.*"}"
        - "{__name__=~"node.*"}"
    static_configs:
      - targets:
        - "192.168.77.11:9090"
        - "192.168.77.12:9090"

本文為容器監控實踐系列文章，完整內容見：container-monitor-book

云服務器 GPU云服務器服務治理與發現微服務架構與實踐買了騰訊云服務器發現配置不夠監控的云存儲服務器配置

文章版權歸作者所有，未經允許請勿轉載,若此文章存在違規行為，您可以聯系管理員刪除。

轉載請注明本文地址：http://specialneedsforspecialkids.com/yun/33133.html

容器監控實踐—Prometheus基本架構

摘要：根據配置文件，對接收到的警報進行處理，發出告警。在默認情況下，用戶只需要部署多套，采集相同的即可實現基本的。通過將監控與數據分離，能夠更好地進行彈性擴展。參考文檔本文為容器監控實踐系列文章，完整內容見系統架構圖 1.x版本的Prometheus的架構圖為：showImg(https://segmentfault.com/img/remote/1460000018372350?w=14...

gghyoo 2019-07-01 17:36 評論0 收藏0
容器監控實踐—Prometheus基本架構

摘要：根據配置文件，對接收到的警報進行處理，發出告警。在默認情況下，用戶只需要部署多套，采集相同的即可實現基本的。通過將監控與數據分離，能夠更好地進行彈性擴展。參考文檔本文為容器監控實踐系列文章，完整內容見系統架構圖 1.x版本的Prometheus的架構圖為：showImg(https://segmentfault.com/img/remote/1460000018372350?w=14...

elina 2019-07-01 17:06 評論0 收藏0
容器監控實踐—Prometheus部署方案

摘要：同時有權限控制日志審計整體配置過期時間等功能。將成為趨勢前置條件要求的版本應該是因為和支持的限制的核心思想是將的部署與它監控的對象的配置分離，做到部署與監控對象的配置分離之后，就可以輕松實現動態配置。一.單獨部署二進制安裝各版本下載地址：https://prometheus.io/download/ Docker運行運行命令：docker run --name promet...

GeekQiaQia 2019-07-01 17:06 評論0 收藏0
容器監控實踐—Prometheus的配置與服務發現

摘要：一概述的配置可以用命令行參數或者配置文件，如果是在集群內，一般配置在中以下均為版本查看可用的命令行參數，可以執行也可以指定對應的配置文件，參數一般為如果配置有修改，如增添采集，可以重新加載它的配置。目前主要支持種服務發現模式，分別是。本文將分析Prometheus的常見配置與服務發現，分為概述、配置詳解、服務發現、常見場景四個部分進行講解。一. 概述 Prometheus的配置可以...

hiyang 2019-07-01 17:06 評論0 收藏0
拉勾網基于 UK8S平臺的容器化改造實踐

摘要：宋體本文從拉勾網的業務架構日志采集監控服務暴露調用等方面介紹了其基于的容器化改造實踐。宋體此外，拉勾網還有一套自研的環境的業務發布系統，不過這套發布系統未適配容器環境。寫在前面拉勾網于 2019 年 3 月份開始嘗試將生產環境的業務從 UHost 遷移到 UK8S，截至 2019 年 9 月份，QA 環境的大部分業務模塊已經完成容器化改造，生產環境中，后臺管理服務已全部遷移到 UK8...

CoorChice 2019-12-27 12:14 評論0 收藏0