AlertManager部署、监控、规则告警
一、K8S中部署Alertmanager
1.1、先来看下报警流程:

大概包含如下几个过程:
- 1、部署Alertmanager
- 2、配置Prometheus与Alertmanager通信
- 3、配置告警
- 4、prometheus指定rules目录
- 5、configmap存储告警规则
- 6、configmap挂载到容器rules目录
- 7、增加alertmanager告警配置
首先需要准备一个发件邮箱,开启stmp发送功能;
1、使用configmap存储告警规则,编写报警规则的yaml文件,可根据自己的实际情况进行修改和添加报警的规则,prometheus比zabbix就麻烦在这里,所有的告警规则需要自己去定义:
[root@k8s-master prometheus-k8s]# vim prometheus-rules.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: prometheus-rules
namespace: kube-system
data:
general.rules: |
groups:
- name: general.rules
rules:
- alert: 采集器挂了
expr: up == 0
for: 1m
labels:
severity: error
annotations:
summary: "Instance {{ $labels.instance }} 采集器停止工作"
description: "{{ $labels.instance }} job {{ $labels.job }} 已经停止1分钟以上."
node.rules: |
groups:
- name: node.rules
rules:
- alert: 磁盘剩余空间过低
expr: 100 - (node_filesystem_free_bytes{fstype=~"ext4|xfs"} / node_filesystem_size_bytes{fstype=~"ext4|xfs"} * 100) > 80
for: 1m
labels:
severity: warning
annotations:
summary: "Instance {{ $labels.instance }} : {{ $labels.mountpoint }} 分区使用率过高"
description: "{{ $labels.instance }}: {{ $labels.mountpoint }} 分区使用大于80% (当前值: {{ $value }})"
- alert: 内存炸了
expr: 100 - (node_memory_MemFree_bytes+node_memory_Cached_bytes+node_memory_Buffers_bytes) / node_memory_MemTotal_bytes * 10
0 > 80
for: 1m
labels:
severity: warning
annotations:
summary: "Instance {{ $labels.instance }} 内存使用率过高"
description: "{{ $labels.instance }}内存使用大于80% (当前值: {{ $value }})"
- alert: CPU炸了
expr: 100 - (avg(irate(node_cpu_seconds_total{mode="idle"}[5m])) by (instance) * 100) > 60
for: 1m
labels:
severity: warning
annotations:
summary: "Instance {{ $labels.instance }} CPU使用率过高"
description: "{{ $labels.instance }}CPU使用大于60% (当前值: {{ $value }})"
[root@k8s-master prometheus-k8s]# kubectl apply -f prometheus-rules.yaml
2、编写告警configmap的yaml文件部署,增加alertmanager告警配置,进行配置邮箱发送地址:
[root@k8s-master prometheus-k8s]# vim alertmanager-configmap.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: alertmanager-config
namespace: kube-system
labels:
kubernetes.io/cluster-service: "true"
addonmanager.kubernetes.io/mode: EnsureExists
data:
alertmanager.yml: |
global:
resolve_timeout: 1m
smtp_smarthost: 'xxx.com.cn:25' #登陆邮件进行查看
smtp_from: 'xxx@163.com' #根据自己申请的发件邮箱进行配置
smtp_auth_username: 'xxx@163.com'
smtp_auth_password: 'xxxxx'
receivers:
- name: default-receiver
email_configs:
- to: "XXX@XXX" #收件人
route:
group_interval: 1m
group_wait: 10s
receiver: default-receiver
repeat_interval: 1m
[root@k8s-master prometheus-k8s]# kubectl apply -f alertmanager-configmap.yaml
3、创建PVC进行数据持久化,我这个yaml文件使用的跟Prometheus安装时用的存储类来进行自动供给 managed-nfs-storage ,需要根据自己的实际情况修改:
[root@k8s-master prometheus-k8s]# vim alertmanager-pvc.yaml
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: alertmanager
namespace: kube-system
labels:
kubernetes.io/cluster-service: "true"
addonmanager.kubernetes.io/mode: EnsureExists
spec:
storageClassName: managed-nfs-storage
accessModes:
- ReadWriteOnce
resources:
requests:
storage: "2Gi"
[root@k8s-master prometheus-k8s]# kubectl apply -f alertmanager-pvc.yaml
4、编写deployment的yaml来部署alertmanager的pod:
[root@k8s-master prometheus-k8s]# vim alertmanager-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: alertmanager
namespace: kube-system
labels:
k8s-app: alertmanager
kubernetes.io/cluster-service: "true"
addonmanager.kubernetes.io/mode: Reconcile
version: v0.14.0
spec:
replicas: 1
selector:
matchLabels:
k8s-app: alertmanager
version: v0.14.0
template:
metadata:
labels:
k8s-app: alertmanager
version: v0.14.0
annotations:
scheduler.alpha.kubernetes.io/critical-pod: ''
spec:
priorityClassName: system-cluster-critical
containers:
- name: prometheus-alertmanager
image: "prom/alertmanager:v0.14.0"
imagePullPolicy: "IfNotPresent"
args:
- --config.file=/etc/config/alertmanager.yml
- --storage.path=/data
- --web.external-url=/
ports:
- containerPort: 9093
readinessProbe:
httpGet:
path: /#/status
port: 9093
initialDelaySeconds: 30
timeoutSeconds: 30
volumeMounts:
- name: config-volume
mountPath: /etc/config
- name: storage-volume
mountPath: "/data"
subPath: ""
resources:
limits:
cpu: 10m
memory: 50Mi
requests:
cpu: 10m
memory: 50Mi
- name: prometheus-alertmanager-configmap-reload
image: "zhdya/configmap-reload:v0.1"
imagePullPolicy: "IfNotPresent"
args:
- --volume-dir=/etc/config
- --webhook-url=http://localhost:9093/-/reload
volumeMounts:
- name: config-volume
mountPath: /etc/config
readOnly: true
resources:
limits:
cpu: 10m
memory: 10Mi
requests:
cpu: 10m
memory: 10Mi
volumes:
- name: config-volume
configMap:
name: alertmanager-config
- name: storage-volume
persistentVolumeClaim:
claimName: alertmanager
[root@k8s-master prometheus-k8s]# kubectl apply -f alertmanager-deployment.yaml
5、创建 alertmanager的service对外暴露的端口:
[root@k8s-master prometheus-k8s]# vim alertmanager-service.yaml
apiVersion: v1
kind: Service
metadata:
name: alertmanager
namespace: kube-system
labels:
kubernetes.io/cluster-service: "true"
addonmanager.kubernetes.io/mode: Reconcile
kubernetes.io/name: "Alertmanager"
spec:
ports:
- name: http
port: 80
protocol: TCP
targetPort: 9093
selector:
k8s-app: alertmanager
type: "ClusterIP"
[root@k8s-master prometheus-k8s]# kubectl apply -f alertmanager-service.yaml
6、检测部署状态,可以发现pod/alertmanager-5d75d5688f-fmlq6和service/alertmanager正常运行
[root@k8s-master prometheus-k8s]# kubectl get pod,svc -n kube-system -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
pod/alertmanager-5d75d5688f-fmlq6 2/2 Running 4 10d 172.17.15.2 192.168.73.140 <none> <none>
pod/grafana-0 1/1 Running 3 16d 172.17.31.2 192.168.73.139 <none> <none>
pod/kube-state-metrics-7c76bdbf68-hv56m 2/2 Running 0 23h 172.17.15.3 192.168.73.140 <none> <none>
pod/prometheus-0 2/2 Running 8 16d 172.17.83.2 192.168.73.135 <none> <none>
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE SELECTOR
service/alertmanager ClusterIP 10.0.0.207 <none> 80/TCP 14d k8s-app=alertmanager
service/grafana NodePort 10.0.0.74 <none> 80:30091/TCP 16d app=grafana
service/kube-state-metrics ClusterIP 10.0.0.194 <none> 8080/TCP,8081/TCP 15d k8s-app=kube-state-metrics
service/prometheus NodePort 10.0.0.33 <none> 9090:30090/TCP 15d k8s-app=prometheus
二、测试告警发送
登录prometheus默认的web界面,选择Alert菜单,则可以看到刚才我们使用prometheus-rules.yaml定义的的四个告警规则:

2.1、告警测试:
因为告警规则中定义了一个 采集器挂了 的实例,所以我们可以停掉某台Node服务器上的kubelet,来测试是否可以收到报警邮件发:
# systemctl stop kubelet
稍微等一会,我们再刷新刚才的web上的告警规则界面,可以发现 采集器挂了 的实例颜色变成粉红色了,并且显示2 active

根据规则等待1分钟之后,我们去刷新刚才配置的告警收件箱,收到了一个封 采集器挂了 的邮件提醒,邮件的发送间隔时间可以在alertmanager-configmap.yaml配置文件中进行设置,恢复刚才停止的kubelet,将不会收到告警邮件提醒

本博客所有文章除特别声明外,均采用 CC BY-SA 4.0 协议 ,转载请注明出处!