k8s 集群的容器启动顺序异常解决
一、背景
当服务器重启之后,基础组件coredns的服务还未正常启动,业务服务已经启动完成;此时的业务链接的service解析异常,导致服务启动异常;因此决定容器的启动顺序便极为关键
二、解决方案
在pod服务启动之前,对k8s的 apiserver的service进行curl检测一定次数,若不通则直接停止退出;若curl正常则继续业务启动
实现方法:
1、添加的k8s apiserver的service检测脚本
a、使用curl 命令检测 kubernetes.default:443 的service 接口是否正常
b、检测3次;当检测不正常时,则退出返回 错误码:4 ; 当正常时 直接进入下一步
#!/bin/bash
count=0
while [ $count -le 3 ]
do
date=`date "+%Y-%m-%d-%H:%M:%s"`
sleep 2
### 检测kubernetes的 apiserver service的 是否正常
curl -s "kubernetes.default:443"
if [ $? == 0 ];then
### 日志输出至标准输出
echo "$date Apiserver Service Is Ready " >> /dev/stdout
break
else
((count=$count + 1))
### 日志输出至标准输出
echo "$date Apiserver Service Is NotReady" >> /dev/stdout
fi
if [ $count -ge 4 ];then
### 返回错误码 4
exit 4
fi
done
2、将apiserver 的service 检测脚本,在制作服务镜像时添加进去
1、Dockerfile 打镜像配置
例:
FROM docker.XXX.com:15000/XXX-backend:2.3.0
ADD check-coredns.sh /
ADD start.sh /
CMD ["/start.sh"]
2、start.sh 容器的启动脚本
例:
#!/bin/bash
rm -rf /log
mkdir /log
#### 执行 kubernetes的 service的检测
sh /check-coredns.sh
if [ $? == 4 ];then
### 当脚本返回为 错误码 4 时 则退出执行服务重启
exit 4
fi
mkdir -p /root/log/${HOSTNAME}_${PROJECT_NS}_${PROJECT_NAME}
ln -s /root/log/${HOSTNAME}_${PROJECT_NS}_${PROJECT_NAME} /log/user
nohup /app_home/bin/config_generate &
/usr/local/tomcat/bin/catalina.sh run
check-coredns.sh (见第1、k8s apiserver的service检测脚本)
2、制作docker的镜像
例:
docker build -t="docker.XXX.com:15000/XXX-backend:2.3.3" .
docker push docker.XXX.com:15000/XXX-backend:2.3.3
3、coredns 版本更新
coredns的版本由:1.2.6 升至 1.6.2版本 升级的yaml如下
coredns_deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
annotations:
deployment.kubernetes.io/revision: "1"
creationTimestamp: "2020-03-24T12:09:31Z"
generation: 1
labels:
k8s-app: kube-dns
name: coredns
namespace: kube-system
resourceVersion: "9078"
selfLink: /apis/apps/v1/namespaces/kube-system/deployments/coredns
uid: 5f31329d-1ef3-465e-8ffb-453d9fa29e5c
spec:
progressDeadlineSeconds: 600
replicas: 2
revisionHistoryLimit: 10
selector:
matchLabels:
k8s-app: kube-dns
strategy:
rollingUpdate:
maxSurge: 25%
maxUnavailable: 1
type: RollingUpdate
template:
metadata:
creationTimestamp: null
labels:
k8s-app: kube-dns
spec:
containers:
- args:
- -conf
- /etc/coredns/Corefile
image: docker.XXX.com:15000/coredns:1.6.2
imagePullPolicy: IfNotPresent
livenessProbe:
failureThreshold: 5
httpGet:
path: /health
port: 8080
scheme: HTTP
initialDelaySeconds: 60
periodSeconds: 10
successThreshold: 1
timeoutSeconds: 5
name: coredns
ports:
- containerPort: 53
name: dns
protocol: UDP
- containerPort: 53
name: dns-tcp
protocol: TCP
- containerPort: 9153
name: metrics
protocol: TCP
readinessProbe:
failureThreshold: 3
httpGet:
path: /ready
port: 8181
scheme: HTTP
periodSeconds: 10
successThreshold: 1
timeoutSeconds: 1
resources:
limits:
memory: 170Mi
requests:
cpu: 100m
memory: 70Mi
securityContext:
allowPrivilegeEscalation: false
capabilities:
add:
- NET_BIND_SERVICE
drop:
- all
readOnlyRootFilesystem: true
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
volumeMounts:
- mountPath: /etc/coredns
name: config-volume
readOnly: true
dnsPolicy: Default
nodeSelector:
beta.kubernetes.io/os: linux
priorityClassName: system-cluster-critical
restartPolicy: Always
schedulerName: default-scheduler
securityContext: {}
serviceAccount: coredns
serviceAccountName: coredns
terminationGracePeriodSeconds: 30
tolerations:
- key: CriticalAddonsOnly
operator: Exists
- effect: NoSchedule
key: node-role.kubernetes.io/master
volumes:
- configMap:
defaultMode: 420
items:
- key: Corefile
path: Corefile
name: coredns
name: config-volume
status:
availableReplicas: 2
conditions:
- lastTransitionTime: "2020-03-24T13:48:23Z"
lastUpdateTime: "2020-03-24T13:48:23Z"
message: Deployment has minimum availability.
reason: MinimumReplicasAvailable
status: "True"
type: Available
- lastTransitionTime: "2020-03-24T13:48:23Z"
lastUpdateTime: "2020-03-24T13:48:25Z"
message: ReplicaSet "coredns-67c766df46" has successfully progressed.
reason: NewReplicaSetAvailable
status: "True"
type: Progressing
observedGeneration: 1
readyReplicas: 2
replicas: 2
updatedReplicas: 2
coredns_svc.yaml
apiVersion: v1
kind: Service
metadata:
annotations:
prometheus.io/port: "9153"
prometheus.io/scrape: "true"
creationTimestamp: "2020-03-24T12:09:31Z"
labels:
k8s-app: kube-dns
kubernetes.io/cluster-service: "true"
kubernetes.io/name: KubeDNS
name: kube-dns
namespace: kube-system
resourceVersion: "190"
selfLink: /api/v1/namespaces/kube-system/services/kube-dns
uid: f280fae8-cb9c-4ff4-ac3f-13103ec9fd38
spec:
clusterIP: 10.96.0.10
ports:
- name: dns
port: 53
protocol: UDP
targetPort: 53
- name: dns-tcp
port: 53
protocol: TCP
targetPort: 53
- name: metrics
port: 9153
protocol: TCP
targetPort: 9153
selector:
k8s-app: kube-dns
sessionAffinity: None
type: ClusterIP
status:
loadBalancer: {}
coredns_config.yaml
apiVersion: v1
data:
Corefile: |
.:53 {
errors
health
ready
kubernetes cluster.local in-addr.arpa ip6.arpa {
pods insecure
fallthrough in-addr.arpa ip6.arpa
ttl 30
}
hosts {
10.68.7.129 mon01
10.68.7.128 mon02
10.68.7.127 mon03
ttl 60
fallthrough
}
prometheus :9153
forward . /etc/resolv.conf
cache 30
loop
reload
loadbalance
}
kind: ConfigMap
metadata:
creationTimestamp: "2020-03-24T12:09:31Z"
name: coredns
namespace: kube-system
resourceVersion: "184"
selfLink: /api/v1/namespaces/kube-system/configmaps/coredns
uid: 76518f5f-6e94-4be6-b307-dc3e753ee373
]# kubectl delete -f coredns_config.yaml;kubectl delete -f coredns_svc.yaml ;kubectl delete -f coredns_deployment.yaml
]# kubectl create -f coredns_config.yaml;kubectl create -f coredns_svc.yaml ;kubectl create -f coredns_deployment.yaml
变更:
cat coredns_deployment.yaml
..........
..........
readinessProbe:
failureThreshold: 3
httpGet:
path: /ready
port: 8181
scheme: HTTP
periodSeconds: 10
successThreshold: 1
timeoutSeconds: 1
...........
...........
添加了就绪探针
4、测试方法
1、获取coredns的yaml
kubectl -n kube-system get deploy coredns -o yaml > coredns.yaml
2、停止coredns服务
kubectl delete -f coredns.yaml
3、使用新的镜像重建服务
cat XXX-backend.yaml
......
image: docker.XXX.com:15000/XXX-backend:2.3.3
imagePullPolicy: IfNotPresent
......
kubectl delete -f XXX-backend.yaml
kubectl create -f XXX-backend.yaml
4、查看日志发现
[root@k8s-10 ops]# kubectl -n XXX-project-namespace logs -f XXX-backend-7957bd486d-k9mwt
2020-04-14-23:55:1586879711 Apiserver Service Is NotReady
2020-04-14-23:55:1586879733 Apiserver Service Is NotReady
2020-04-14-23:55:1586879756 Apiserver Service Is NotReady
2020-04-14-23:56:1586879778 Apiserver Service Is NotReady
2020-04-14-23:56:1586879801 Apiserver Service Is NotReady
2020-04-14-23:57:1586879824 Apiserver Service Is NotReady
2020-04-14-23:57:1586879846 Apiserver Service Is NotReady
2020-04-14-23:57:1586879869 Apiserver Service Is NotReady
2020-04-14-23:58:1586879891 Apiserver Service Is NotReady
2020-04-14-23:58:1586879914 Apiserver Service Is NotReady
由于coredns 没能正常启动 XXX-backend 无法正常启动
一段时间后便会重启
[root@k8s-10 ops]# kubectl get pods --all-namespaces |grep XXX-backend
XXX-project-namespace XXX-backend-7957bd486d-k9mwt 0/1 Running 1 4m40s
XXX-project-namespace XXX-backend-7957bd486d-mwt9k 0/1 Running 1 4m40s
XXX-project-namespace XXX-backend-7957bd486d-zzq9b 0/1 Running 1 4m40s
此处restart 变为 1
5、启动coredns 的组件
[root@k8s-10 ops]# kubectl -n XXX-project-namespace logs -f XXX-backend-7957bd486d-k9mwt
2020-04-14-23:59:1586879945 Apiserver Service Is NotReady
2020-04-14-23:59:1586879967 Apiserver Service Is NotReady
2020-04-14-23:59:1586879990 Apiserver Service Is NotReady
2020-04-15-00:00:1586880012 Apiserver Service Is NotReady
2020-04-15-00:00:1586880035 Apiserver Service Is NotReady
2020-04-15-00:00:1586880057 Apiserver Service Is NotReady
2020-04-15-00:01:1586880080 Apiserver Service Is NotReady
2020-04-15-00:01:1586880102 Apiserver Service Is NotReady
#### 此处发现service 已正常 业务已经正常启动
2020-04-15-00:02:1586880120 Apiserver Service Is Ready
/usr/local/tomcat/webapps/dops/WEB-INF/classes/
Java HotSpot(TM) 64-Bit Server VM warning: Cannot open file /log/jvm/gcdetail.log due to No such file or directory
15-Apr-2020 00:02:03.228 WARNING [main] org.apache.catalina.startup.SetAllPropertiesRule.begin [SetAllPropertiesRule]{Server/Service/Connector} Setting property 'maxPostSize' to '10737418240' did not find a matching property.
15-Apr-2020 00:02:03.275 INFO [main] org.apache.catalina.startup.VersionLoggerListener.log Server version: Apache Tomcat/8.5.27
15-Apr-2020 00:02:03.275 INFO [main] org.apache.catalina.startup.VersionLoggerListener.log Server built: Jan 18 2018 20:12:40 UTC
15-Apr-2020 00:02:03.275 INFO [main] org.apache.catalina.startup.VersionLoggerListener.log Server number: 8.5.27.0
15-Apr-2020 00:02:03.276 INFO [main] org.apache.catalina.startup.VersionLoggerListener.log OS Name: Linux
结论
上述镜像制作方法可实现:服务在能正常解析集群内服务service之后启动
本博客所有文章除特别声明外,均采用 CC BY-SA 4.0 协议 ,转载请注明出处!