AlertManager部署、监控、规则告警

一、K8S中部署Alertmanager

1.1、先来看下报警流程:

mark

大概包含如下几个过程:

  • 1、部署Alertmanager
  • 2、配置Prometheus与Alertmanager通信
  • 3、配置告警
  • 4、prometheus指定rules目录
  • 5、configmap存储告警规则
  • 6、configmap挂载到容器rules目录
  • 7、增加alertmanager告警配置

首先需要准备一个发件邮箱,开启stmp发送功能;

1、使用configmap存储告警规则,编写报警规则的yaml文件,可根据自己的实际情况进行修改和添加报警的规则,prometheus比zabbix就麻烦在这里,所有的告警规则需要自己去定义:

[root@k8s-master prometheus-k8s]# vim prometheus-rules.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: prometheus-rules
  namespace: kube-system
data:
  general.rules: |
    groups:
    - name: general.rules
      rules:
      - alert: 采集器挂了
        expr: up == 0
        for: 1m
        labels:
          severity: error
        annotations:
          summary: "Instance {{ $labels.instance }} 采集器停止工作"
          description: "{{ $labels.instance }} job {{ $labels.job }} 已经停止1分钟以上."
  node.rules: |
    groups:
    - name: node.rules
      rules:
      - alert: 磁盘剩余空间过低
        expr: 100 - (node_filesystem_free_bytes{fstype=~"ext4|xfs"} / node_filesystem_size_bytes{fstype=~"ext4|xfs"} * 100) > 80
        for: 1m
        labels:
          severity: warning
        annotations:
          summary: "Instance {{ $labels.instance }} : {{ $labels.mountpoint }} 分区使用率过高"
          description: "{{ $labels.instance }}: {{ $labels.mountpoint }} 分区使用大于80% (当前值: {{ $value }})"

      - alert: 内存炸了
        expr: 100 - (node_memory_MemFree_bytes+node_memory_Cached_bytes+node_memory_Buffers_bytes) / node_memory_MemTotal_bytes * 10
0 > 80
        for: 1m
        labels:
          severity: warning
        annotations:
          summary: "Instance {{ $labels.instance }} 内存使用率过高"
          description: "{{ $labels.instance }}内存使用大于80% (当前值: {{ $value }})"

      - alert: CPU炸了
        expr: 100 - (avg(irate(node_cpu_seconds_total{mode="idle"}[5m])) by (instance) * 100) > 60
        for: 1m
        labels:
          severity: warning
        annotations:
          summary: "Instance {{ $labels.instance }} CPU使用率过高"
          description: "{{ $labels.instance }}CPU使用大于60% (当前值: {{ $value }})"
[root@k8s-master prometheus-k8s]# kubectl apply -f prometheus-rules.yaml

2、编写告警configmap的yaml文件部署,增加alertmanager告警配置,进行配置邮箱发送地址:

[root@k8s-master prometheus-k8s]# vim alertmanager-configmap.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: alertmanager-config
  namespace: kube-system
  labels:
    kubernetes.io/cluster-service: "true"
    addonmanager.kubernetes.io/mode: EnsureExists
data:
  alertmanager.yml: |
    global:
      resolve_timeout: 1m
      smtp_smarthost: 'xxx.com.cn:25'  #登陆邮件进行查看
      smtp_from: 'xxx@163.com' #根据自己申请的发件邮箱进行配置
      smtp_auth_username: 'xxx@163.com'
      smtp_auth_password: 'xxxxx'

    receivers:
    - name: default-receiver
      email_configs:
      - to: "XXX@XXX"   #收件人

    route:
      group_interval: 1m
      group_wait: 10s
      receiver: default-receiver
      repeat_interval: 1m

[root@k8s-master prometheus-k8s]# kubectl apply -f alertmanager-configmap.yaml

3、创建PVC进行数据持久化,我这个yaml文件使用的跟Prometheus安装时用的存储类来进行自动供给 managed-nfs-storage ,需要根据自己的实际情况修改:

[root@k8s-master prometheus-k8s]# vim alertmanager-pvc.yaml
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: alertmanager
  namespace: kube-system
  labels:
    kubernetes.io/cluster-service: "true"
    addonmanager.kubernetes.io/mode: EnsureExists
spec:
  storageClassName: managed-nfs-storage
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: "2Gi"

[root@k8s-master prometheus-k8s]# kubectl apply -f alertmanager-pvc.yaml

4、编写deployment的yaml来部署alertmanager的pod:

[root@k8s-master prometheus-k8s]# vim alertmanager-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: alertmanager
  namespace: kube-system
  labels:
    k8s-app: alertmanager
    kubernetes.io/cluster-service: "true"
    addonmanager.kubernetes.io/mode: Reconcile
    version: v0.14.0
spec:
  replicas: 1
  selector:
    matchLabels:
      k8s-app: alertmanager
      version: v0.14.0
  template:
    metadata:
      labels:
        k8s-app: alertmanager
        version: v0.14.0
      annotations:
        scheduler.alpha.kubernetes.io/critical-pod: ''
    spec:
      priorityClassName: system-cluster-critical
      containers:
        - name: prometheus-alertmanager
          image: "prom/alertmanager:v0.14.0"
          imagePullPolicy: "IfNotPresent"
          args:
            - --config.file=/etc/config/alertmanager.yml
            - --storage.path=/data
            - --web.external-url=/
          ports:
            - containerPort: 9093
          readinessProbe:
            httpGet:
              path: /#/status
              port: 9093
            initialDelaySeconds: 30
            timeoutSeconds: 30
          volumeMounts:
            - name: config-volume
              mountPath: /etc/config
            - name: storage-volume
              mountPath: "/data"
              subPath: ""
          resources:
            limits:
              cpu: 10m
              memory: 50Mi
            requests:
              cpu: 10m
              memory: 50Mi
        - name: prometheus-alertmanager-configmap-reload
          image: "zhdya/configmap-reload:v0.1"
          imagePullPolicy: "IfNotPresent"
          args:
            - --volume-dir=/etc/config
            - --webhook-url=http://localhost:9093/-/reload
          volumeMounts:
            - name: config-volume
              mountPath: /etc/config
              readOnly: true
          resources:
            limits:
              cpu: 10m
              memory: 10Mi
            requests:
              cpu: 10m
              memory: 10Mi
      volumes:
        - name: config-volume
          configMap:
            name: alertmanager-config
        - name: storage-volume
          persistentVolumeClaim:
            claimName: alertmanager

[root@k8s-master prometheus-k8s]# kubectl apply -f alertmanager-deployment.yaml

5、创建 alertmanager的service对外暴露的端口:

[root@k8s-master prometheus-k8s]# vim alertmanager-service.yaml
apiVersion: v1
kind: Service
metadata:
  name: alertmanager
  namespace: kube-system
  labels:
    kubernetes.io/cluster-service: "true"
    addonmanager.kubernetes.io/mode: Reconcile
    kubernetes.io/name: "Alertmanager"
spec:
  ports:
    - name: http
      port: 80
      protocol: TCP
      targetPort: 9093
  selector:
    k8s-app: alertmanager
  type: "ClusterIP"

[root@k8s-master prometheus-k8s]# kubectl apply -f alertmanager-service.yaml

6、检测部署状态,可以发现pod/alertmanager-5d75d5688f-fmlq6和service/alertmanager正常运行

[root@k8s-master prometheus-k8s]# kubectl get pod,svc -n kube-system -o wide
NAME                                        READY   STATUS    RESTARTS   AGE   IP            NODE             NOMINATED NODE   READINESS GATES
pod/alertmanager-5d75d5688f-fmlq6           2/2     Running   4          10d   172.17.15.2   192.168.73.140   <none>           <none>
pod/grafana-0                               1/1     Running   3          16d   172.17.31.2   192.168.73.139   <none>           <none>
pod/kube-state-metrics-7c76bdbf68-hv56m     2/2     Running   0          23h   172.17.15.3   192.168.73.140   <none>           <none>
pod/prometheus-0                            2/2     Running   8          16d   172.17.83.2   192.168.73.135   <none>           <none>

NAME                           TYPE        CLUSTER-IP   EXTERNAL-IP   PORT(S)             AGE   SELECTOR
service/alertmanager           ClusterIP   10.0.0.207   <none>        80/TCP              14d   k8s-app=alertmanager
service/grafana                NodePort    10.0.0.74    <none>        80:30091/TCP        16d   app=grafana
service/kube-state-metrics     ClusterIP   10.0.0.194   <none>        8080/TCP,8081/TCP   15d   k8s-app=kube-state-metrics
service/prometheus             NodePort    10.0.0.33    <none>        9090:30090/TCP      15d   k8s-app=prometheus

二、测试告警发送

登录prometheus默认的web界面,选择Alert菜单,则可以看到刚才我们使用prometheus-rules.yaml定义的的四个告警规则:

mark

2.1、告警测试:

因为告警规则中定义了一个 采集器挂了 的实例,所以我们可以停掉某台Node服务器上的kubelet,来测试是否可以收到报警邮件发:

# systemctl stop kubelet

稍微等一会,我们再刷新刚才的web上的告警规则界面,可以发现 采集器挂了 的实例颜色变成粉红色了,并且显示2 active

mark

根据规则等待1分钟之后,我们去刷新刚才配置的告警收件箱,收到了一个封 采集器挂了 的邮件提醒,邮件的发送间隔时间可以在alertmanager-configmap.yaml配置文件中进行设置,恢复刚才停止的kubelet,将不会收到告警邮件提醒

mark