K8S中Pod、Node、资源对象等监控
一、监控 K8S 集群 node
1.1、node_explorer安装配置
Pod
kubelet的节点使用cAdvisor提供的metrics接口获取该节点所有Pod和容器相关的性能指标数据,安装kubelet默认就开启了 暴露接口地址:
https://NodeIP:10255/metrics/cadvisor
https://NodeIP:10250/metrics/cadvisor
Node
需要使用node_exporter收集器采集节点资源利用率。
https://github.com/prometheus/node_exporter
使用文档:https://prometheus.io/docs/guides/node-exporter/
使用node_exporter.sh脚本分别在所有服务器上部署node_exporter收集器,不需要修改可直接运行脚本
[root@k8s-master prometheus-k8s]# cat node_exporter.sh
#!/bin/bash
wget https://github.com/prometheus/node_exporter/releases/download/v0.17.0/node_exporter-0.17.0.linux-amd64.tar.gz
## 解压重命名
tar zxf node_exporter-0.17.0.linux-amd64.tar.gz
mv node_exporter-0.17.0.linux-amd64 /usr/local/node_exporter
##添加系统服务
cat <<EOF >/usr/lib/systemd/system/node_exporter.service
[Unit]
Description=https://prometheus.io
[Service]
Restart=on-failure
ExecStart=/usr/local/node_exporter/node_exporter --collector.systemd --collector.systemd.unit-whitelist=(docker|kubelet|kube-proxy|flanneld).service
[Install]
WantedBy=multi-user.target
EOF
##启动node_exporter
systemctl daemon-reload
systemctl enable node_exporter
systemctl restart node_exporter
[root@k8s-master prometheus-k8s]# ./node_exporter.sh
当然如果自己公司有ansible或者saltstack那就更好解决问题了(无需下载,更快速高效!):
# ansible node -m copy -a "src=./node_exporter-0.18.1.linux-amd64.tar.gz dest=/tmp/"
# ansible node -m script -a "./node-exporter.sh"
这样的话就需要更改如上脚本的:
## 解压重命名
tar zxf /tmp/node_exporter-0.18.1.linux-amd64.tar.gz -C /usr/local/
mv /usr/local/node_exporter-0.18.1.linux-amd64 /usr/local/node_exporter
1.2、监控 node 组件的状态
改一下 Configmap,自动发现是没办法发现我的 node 了,我把我的 node 和 master 分开了,master顺便监控了master组件的运行状态,还有需要添加一下数据采集的间隔时间,默认是一分钟,现在改成30s,所以在Configmap里面加了这些东西:
global:
scrape_interval: 30s
- job_name: k8s-nodes
static_configs:
- targets:
- 192.168.1.202:9100
- 192.168.1.203:9100
- 192.168.1.204:9100
- 192.168.1.165:9100
- 192.168.1.169:9100
- 192.168.1.172:9100
- job_name: k8s-master
static_configs:
- targets:
- 192.168.1.200:9100
- 192.168.1.201:9100
# kubectl apply -f prometheus-configmap.yaml
configmap/prometheus-config configured
这样就会自动重载 prometheus,还有一个问题要解决,那就是 prometheus 时间的问题,差了八小时,查出来的数据也有问题,所以直接把宿主的 localtime 文件挂到容器内就行了,所以 statefulset 也要加点配置:
volumeMounts:
- name: localtime
mountPath: /etc/localtime
volumes:
- name: localtime
hostPath:
path: /usr/share/zoneinfo/Asia/Shanghai
更新一下,等重启完了看一下 web 页面,能看到这些就对了

然后在 grafana 导入一个模板,ID 为 9276,稍等一会就有数据了:

二、监控 K8S 资源对象
这里监控的就是 K8S 创建资源对象的状态信息,譬如说 service/deployment/replicaset/endpoints 等等,前面也提到了要监控这些东西需要用到 kube-state-metrics 组件,这个组件就是专门采集 K8S 中各种资源对象信息,比较全面。
2.1、创建rbac的yaml对metrics进行授权
[root@k8s-master prometheus-k8s]# vim kube-state-metrics-rbac.yaml
apiVersion: v1
kind: ServiceAccount
metadata:
name: kube-state-metrics
namespace: kube-system
labels:
kubernetes.io/cluster-service: "true"
addonmanager.kubernetes.io/mode: Reconcile
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: kube-state-metrics
labels:
kubernetes.io/cluster-service: "true"
addonmanager.kubernetes.io/mode: Reconcile
rules:
- apiGroups: [""]
resources:
- configmaps
- secrets
- nodes
- pods
- services
- resourcequotas
- replicationcontrollers
- limitranges
- persistentvolumeclaims
- persistentvolumes
- namespaces
- endpoints
verbs: ["list", "watch"]
- apiGroups: ["extensions"]
resources:
- daemonsets
- deployments
- replicasets
verbs: ["list", "watch"]
- apiGroups: ["apps"]
resources:
- statefulsets
verbs: ["list", "watch"]
- apiGroups: ["batch"]
resources:
- cronjobs
- jobs
verbs: ["list", "watch"]
- apiGroups: ["autoscaling"]
resources:
- horizontalpodautoscalers
verbs: ["list", "watch"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
name: kube-state-metrics-resizer
namespace: kube-system
labels:
kubernetes.io/cluster-service: "true"
addonmanager.kubernetes.io/mode: Reconcile
rules:
- apiGroups: [""]
resources:
- pods
verbs: ["get"]
- apiGroups: ["extensions"]
resources:
- deployments
resourceNames: ["kube-state-metrics"]
verbs: ["get", "update"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: kube-state-metrics
labels:
kubernetes.io/cluster-service: "true"
addonmanager.kubernetes.io/mode: Reconcile
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: kube-state-metrics
subjects:
- kind: ServiceAccount
name: kube-state-metrics
namespace: kube-system
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
name: kube-state-metrics
namespace: kube-system
labels:
kubernetes.io/cluster-service: "true"
addonmanager.kubernetes.io/mode: Reconcile
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: Role
name: kube-state-metrics-resizer
subjects:
- kind: ServiceAccount
name: kube-state-metrics
namespace: kube-system
[root@k8s-master prometheus-k8s]# kubectl apply -f kube-state-metrics-rbac.yaml
编写Deployment和ConfigMap的yaml进行metrics pod部署,不需要进行修改
[root@k8s-master prometheus-k8s]# cat kube-state-metrics-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: kube-state-metrics
namespace: kube-system
labels:
k8s-app: kube-state-metrics
kubernetes.io/cluster-service: "true"
addonmanager.kubernetes.io/mode: Reconcile
version: v1.3.0
spec:
selector:
matchLabels:
k8s-app: kube-state-metrics
version: v1.3.0
replicas: 1
template:
metadata:
labels:
k8s-app: kube-state-metrics
version: v1.3.0
annotations:
scheduler.alpha.kubernetes.io/critical-pod: ''
spec:
priorityClassName: system-cluster-critical
serviceAccountName: kube-state-metrics
containers:
- name: kube-state-metrics
image: zhdya/kube-state-metrics:v1.3.0
ports:
- name: http-metrics
containerPort: 8080
- name: telemetry
containerPort: 8081
readinessProbe:
httpGet:
path: /healthz
port: 8080
initialDelaySeconds: 5
timeoutSeconds: 5
- name: addon-resizer
image: zhdya/addon-resizer:1.8.3
resources:
limits:
cpu: 100m
memory: 30Mi
requests:
cpu: 100m
memory: 30Mi
env:
- name: MY_POD_NAME
valueFrom:
fieldRef:
fieldPath: metadata.name
- name: MY_POD_NAMESPACE
valueFrom:
fieldRef:
fieldPath: metadata.namespace
volumeMounts:
- name: config-volume
mountPath: /etc/config
command:
- /pod_nanny
- --config-dir=/etc/config
- --container=kube-state-metrics
- --cpu=100m
- --extra-cpu=1m
- --memory=100Mi
- --extra-memory=2Mi
- --threshold=5
- --deployment=kube-state-metrics
volumes:
- name: config-volume
configMap:
name: kube-state-metrics-config
---
# Config map for resource configuration.
apiVersion: v1
kind: ConfigMap
metadata:
name: kube-state-metrics-config
namespace: kube-system
labels:
k8s-app: kube-state-metrics
kubernetes.io/cluster-service: "true"
addonmanager.kubernetes.io/mode: Reconcile
data:
NannyConfiguration: |-
apiVersion: nannyconfig/v1alpha1
kind: NannyConfiguration
[root@k8s-master prometheus-k8s]# kubectl apply -f kube-state-metrics-deployment.yaml
编写Service的yaml对metrics进行端口暴露:
[root@k8s-master prometheus-k8s]# cat kube-state-metrics-service.yaml
apiVersion: v1
kind: Service
metadata:
name: kube-state-metrics
namespace: kube-system
labels:
kubernetes.io/cluster-service: "true"
addonmanager.kubernetes.io/mode: Reconcile
kubernetes.io/name: "kube-state-metrics"
annotations:
prometheus.io/scrape: 'true'
spec:
ports:
- name: http-metrics
port: 8080
targetPort: http-metrics
protocol: TCP
- name: telemetry
port: 8081
targetPort: telemetry
protocol: TCP
selector:
k8s-app: kube-state-metrics
[root@k8s-master prometheus-k8s]# kubectl apply -f kube-state-metrics-service.yaml
查pod和svc的状态,可以看到正常运行了pod/kube-state-metrics-7c76bdbf68-kqqgd 和对外暴露了8080和8081端口
[root@k8s-master prometheus-k8s]# kubectl get pod,svc -n kube-system
NAME READY STATUS RESTARTS AGE
pod/alertmanager-5d75d5688f-fmlq6 2/2 Running 0 9d
pod/coredns-5bd5f9dbd9-wv45t 1/1 Running 1 9d
pod/grafana-0 1/1 Running 2 15d
pod/kube-state-metrics-7c76bdbf68-kqqgd 2/2 Running 6 14d
pod/kubernetes-dashboard-7d77666777-d5ng4 1/1 Running 5 16d
pod/prometheus-0 2/2 Running 6 15d
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
service/alertmanager ClusterIP 10.0.0.207 <none> 80/TCP 13dservice/grafana NodePort 10.0.0.74 <none> 80:30091/TCP 15dservice/kube-dns ClusterIP 10.0.0.2 <none> 53/UDP,53/TCP 14d
service/kube-state-metrics ClusterIP 10.0.0.194 <none> 8080/TCP,8081/TCP 14dservice/kubernetes-dashboard NodePort 10.0.0.127 <none> 443:30001/TCP 17dservice/prometheus NodePort 10.0.0.33 <none> 9090:30090/TCP 14d
本博客所有文章除特别声明外,均采用 CC BY-SA 4.0 协议 ,转载请注明出处!