K8S自己动手系列 - 1.2 - 节点管理

简介: 上篇文章我们从零搭建了一个1主2从的集群,实际生产实践中,经常要面对的可能就是节点管理了,本文对节点管理部分常用的功能进行实战操作。 主要包含: - 节点状态 - 节点通信及节点宕机 - 移除节点 - 增加节点

节点管理

节点状态

please refer: https://kubernetes.io/docs/concepts/architecture/nodes/#node-status
这里我们重点关注下Condition部分
如文档描述

节点失联 or 节点宕机

查看节点列表

  ~ kubectl get node -o wide
NAME       STATUS   ROLES    AGE   VERSION   INTERNAL-IP       EXTERNAL-IP   OS-IMAGE             KERNEL-VERSION      CONTAINER-RUNTIME
worker01   Ready    master   21h   v1.14.3   192.168.101.113   <none>        Ubuntu 16.04.6 LTS   4.4.0-150-generic   docker://18.9.5
worker02   Ready    <none>   21h   v1.14.3   192.168.100.117   <none>        Ubuntu 16.04.6 LTS   4.4.0-150-generic   docker://18.9.5
# 查看pod和其运行所在节点
  ~ kubectl get pods -o wide
NAME                    READY   STATUS    RESTARTS   AGE   IP            NODE       NOMINATED NODE   READINESS GATES
nginx-6657c9ffc-6q2pt   1/1     Running   1          19h   10.244.0.11   worker01   <none>           <none>
nginx-6657c9ffc-msd6t   1/1     Running   0          19h   10.244.1.17   worker02   <none>           <none>

现在我们使worker02关机

  ~ kubectl get pod -o wide
NAME                    READY   STATUS    RESTARTS   AGE   IP            NODE       NOMINATED NODE   READINESS GATES
nginx-6657c9ffc-6q2pt   1/1     Running   1          20h   10.244.0.11   worker01   <none>           <none>
nginx-6657c9ffc-msd6t   1/1     Running   0          20h   10.244.1.17   worker02   <none>           <none>
  ~ kubectl get node -o wide
NAME       STATUS   ROLES    AGE   VERSION   INTERNAL-IP       EXTERNAL-IP   OS-IMAGE             KERNEL-VERSION      CONTAINER-RUNTIME
worker01   Ready    master   22h   v1.14.3   192.168.101.113   <none>        Ubuntu 16.04.6 LTS   4.4.0-150-generic   docker://18.9.5
worker02   Ready    <none>   21h   v1.14.3   192.168.100.117   <none>        Ubuntu 16.04.6 LTS   4.4.0-150-generic   docker://18.9.5
  ~ ping 192.168.100.117
PING 192.168.100.117 (192.168.100.117) 56(84) bytes of data.
^C
--- 192.168.100.117 ping statistics ---
2 packets transmitted, 0 received, 100% packet loss, time 1007ms

可以看到虽然查看状态还是正常,实际上节点已经无法ping通,再次查看,发现节点已经处于NotReady状态

  ~ kubectl get node -o wide
NAME       STATUS     ROLES    AGE   VERSION   INTERNAL-IP       EXTERNAL-IP   OS-IMAGE             KERNEL-VERSION      CONTAINER-RUNTIME
worker01   Ready      master   22h   v1.14.3   192.168.101.113   <none>        Ubuntu 16.04.6 LTS   4.4.0-150-generic   docker://18.9.5
worker02   NotReady   <none>   21h   v1.14.3   192.168.100.117   <none>        Ubuntu 16.04.6 LTS   4.4.0-150-generic   docker://18.9.5
  ~ kubectl get pod -o wide 
NAME                    READY   STATUS    RESTARTS   AGE   IP            NODE       NOMINATED NODE   READINESS GATES
nginx-6657c9ffc-6q2pt   1/1     Running   1          20h   10.244.0.11   worker01   <none>           <none>
nginx-6657c9ffc-msd6t   1/1     Running   0          20h   10.244.1.17   worker02   <none>           <none>

查看节点详情,发现节点已经被打上node.kubernetes.io/unreachable:NoExecute和node.kubernetes.io/unreachable:NoSchedule两个污点,conditions也随之变化

  ~ kubectl describe node worker02
Name:             worker02
Roles:              <none>
Labels:            beta.kubernetes.io/arch=amd64
...
Annotations:    flannel.alpha.coreos.com/backend-data: {"VtepMAC":"4e:0a:b4:88:0d:83"}
...
CreationTimestamp:  Sat, 08 Jun 2019 12:42:00 +0800
Taints:             node.kubernetes.io/unreachable:NoExecute
                       node.kubernetes.io/unreachable:NoSchedule
Unschedulable:      false
Conditions:
  Type             Status    LastHeartbeatTime                 LastTransitionTime                Reason              Message
  ----             ------    -----------------                 ------------------                ------              -------
  MemoryPressure   Unknown   Sun, 09 Jun 2019 10:03:52 +0800   Sun, 09 Jun 2019 10:04:52 +0800   NodeStatusUnknown   Kubelet stopped posting node status.
  DiskPressure     Unknown   Sun, 09 Jun 2019 10:03:52 +0800   Sun, 09 Jun 2019 10:04:52 +0800   NodeStatusUnknown   Kubelet stopped posting node status.
  PIDPressure      Unknown   Sun, 09 Jun 2019 10:03:52 +0800   Sun, 09 Jun 2019 10:04:52 +0800   NodeStatusUnknown   Kubelet stopped posting node status.
  Ready            Unknown   Sun, 09 Jun 2019 10:03:52 +0800   Sun, 09 Jun 2019 10:04:52 +0800   NodeStatusUnknown   Kubelet stopped posting node status.

再次查看pod,发现已经被调度到其他节点

  ~ kubectl get pods -o wide
NAME                    READY   STATUS        RESTARTS   AGE   IP            NODE       NOMINATED NODE   READINESS GATES
nginx-6657c9ffc-6q2pt   1/1     Running       1          20h   10.244.0.11   worker01   <none>           <none>
nginx-6657c9ffc-msd6t   1/1     Terminating   0          20h   10.244.1.17   worker02   <none>           <none>
nginx-6657c9ffc-n9s4p   1/1     Running       0          71s   10.244.0.14   worker01   <none>           <none>

查看正在terminating的pod状态:

  ~ kubectl describe pod/nginx-6657c9ffc-msd6t
Name:                      nginx-6657c9ffc-msd6t
Node:                      worker02/192.168.100.117
Status:                    Terminating (lasts 2m5s)
Tolerations:     node.kubernetes.io/not-ready:NoExecute for 300s
                 node.kubernetes.io/unreachable:NoExecute for 300s
Events:          <none>

发现两个污点容忍tolerations选项,解释了为什么节点处于NotReady后,pod还是running状态一段时间后被terminate重建
更多细节,please refer https://kubernetes.io/docs/concepts/architecture/nodes/#node-controller

移除节点

为了实验效果,先把pod扩容到10个

  ~ kubectl scale --replicas=10 deployment/nginx
deployment.extensions/nginx scaled

  ~ kubectl get pod -o wide                     
NAME                    READY   STATUS    RESTARTS   AGE   IP            NODE       NOMINATED NODE   READINESS GATES
nginx-6657c9ffc-6lfbp   1/1     Running   0          8s    10.244.1.20   worker02   <none>           <none>
nginx-6657c9ffc-6q2pt   1/1     Running   1          20h   10.244.0.11   worker01   <none>           <none>
nginx-6657c9ffc-8cbpl   1/1     Running   0          8s    10.244.1.23   worker02   <none>           <none>
nginx-6657c9ffc-dbr7c   1/1     Running   0          8s    10.244.1.22   worker02   <none>           <none>
nginx-6657c9ffc-kk84d   1/1     Running   0          8s    10.244.1.21   worker02   <none>           <none>
nginx-6657c9ffc-kp64c   1/1     Running   0          8s    10.244.0.16   worker01   <none>           <none>
nginx-6657c9ffc-n9s4p   1/1     Running   0          40m   10.244.0.14   worker01   <none>           <none>
nginx-6657c9ffc-pxbs2   1/1     Running   0          8s    10.244.1.19   worker02   <none>           <none>
nginx-6657c9ffc-sb85v   1/1     Running   0          8s    10.244.1.18   worker02   <none>           <none>
nginx-6657c9ffc-sxb25   1/1     Running   0          8s    10.244.0.15   worker01   <none>           <none>

使用drain放干节点,然后删除节点

  ~ kubectl drain worker02 --force --grace-period=900 --ignore-daemonsets
node/worker02 already cordoned
WARNING: ignoring DaemonSet-managed Pods: kube-system/kube-flannel-ds-amd64-crq7k, kube-system/kube-proxy-x9h7m
evicting pod "nginx-6657c9ffc-sb85v"
evicting pod "nginx-6657c9ffc-dbr7c"
evicting pod "nginx-6657c9ffc-6lfbp"
evicting pod "nginx-6657c9ffc-8cbpl"
evicting pod "nginx-6657c9ffc-kk84d"
evicting pod "nginx-6657c9ffc-pxbs2"
pod/nginx-6657c9ffc-kk84d evicted
pod/nginx-6657c9ffc-dbr7c evicted
pod/nginx-6657c9ffc-8cbpl evicted
pod/nginx-6657c9ffc-pxbs2 evicted
pod/nginx-6657c9ffc-sb85v evicted
pod/nginx-6657c9ffc-6lfbp evicted
node/worker02 evicted

# 执行完后发现所有pod已经运行在worker01上了
  ~ kubectl get pod -o wide   
NAME                    READY   STATUS    RESTARTS   AGE     IP            NODE       NOMINATED NODE   READINESS GATES
nginx-6657c9ffc-45svt   1/1     Running   0          63s     10.244.0.18   worker01   <none>           <none>
nginx-6657c9ffc-489zz   1/1     Running   0          63s     10.244.0.21   worker01   <none>           <none>
nginx-6657c9ffc-49vqv   1/1     Running   0          63s     10.244.0.20   worker01   <none>           <none>
nginx-6657c9ffc-6q2pt   1/1     Running   1          20h     10.244.0.11   worker01   <none>           <none>
nginx-6657c9ffc-fq7gg   1/1     Running   0          63s     10.244.0.22   worker01   <none>           <none>
nginx-6657c9ffc-kl6j9   1/1     Running   0          63s     10.244.0.19   worker01   <none>           <none>
nginx-6657c9ffc-kp64c   1/1     Running   0          4m31s   10.244.0.16   worker01   <none>           <none>
nginx-6657c9ffc-n9s4p   1/1     Running   0          44m     10.244.0.14   worker01   <none>           <none>
nginx-6657c9ffc-ssmph   1/1     Running   0          63s     10.244.0.17   worker01   <none>           <none>
nginx-6657c9ffc-sxb25   1/1     Running   0          4m31s   10.244.0.15   worker01   <none>           <none>
# 删除节点
  ~ kubectl delete node worker02
node "worker02" deleted
  ~ kubectl get node
NAME       STATUS   ROLES    AGE   VERSION
worker01   Ready    master   22h   v1.14.3

更多详情 please refer https://kubernetes.io/docs/concepts/workloads/pods/disruptions/

加入节点

使用kubeadm初始化master节点后,会有一段提示,说明如何加入新节点,那个务必要保存好
接下来我们将刚刚被下线的worker02重置,然后重新加入集群

  ~ kubeadm reset
[reset] WARNING: Changes made to this host by 'kubeadm init' or 'kubeadm join' will be reverted.
[reset] Are you sure you want to proceed? [y/N]: y
[preflight] Running pre-flight checks
W0609 11:00:23.206928    7376 reset.go:234] [reset] No kubeadm config, using etcd pod spec to get data directory
[reset] No etcd config found. Assuming external etcd
[reset] Please manually reset etcd to prevent further issues
[reset] Stopping the kubelet service
[reset] unmounting mounted directories in "/var/lib/kubelet"
[reset] Deleting contents of stateful directories: [/var/lib/kubelet /etc/cni/net.d /var/lib/dockershim /var/run/kubernetes]
[reset] Deleting contents of config directories: [/etc/kubernetes/manifests /etc/kubernetes/pki]
[reset] Deleting files: [/etc/kubernetes/admin.conf /etc/kubernetes/kubelet.conf /etc/kubernetes/bootstrap-kubelet.conf /etc/kubernetes/controller-manager.conf /etc/kubernetes/scheduler.conf]

The reset process does not reset or clean up iptables rules or IPVS tables.
If you wish to reset iptables, you must do so manually.
For example:
iptables -F && iptables -t nat -F && iptables -t mangle -F && iptables -X

If your cluster was setup to utilize IPVS, run ipvsadm --clear (or similar)
to reset your system's IPVS tables.

  ~ iptables -F && iptables -t nat -F && iptables -t mangle -F && iptables -X

执行kubeadm join

  ~ kubeadm join 192.168.101.113:6443 --token ss6flg.csw4u0ok134n2fy1 \      
    --discovery-token-ca-cert-hash sha256:bac9a150228342b7cdedf39124ef2108653db1f083e9f547d251e08f03c41945
[preflight] Running pre-flight checks
    [WARNING IsDockerSystemdCheck]: detected "cgroupfs" as the Docker cgroup driver. The recommended driver is "systemd". Please follow the guide at https://kubernetes.io/docs/setup/cri/
[preflight] Reading configuration from the cluster...
[preflight] FYI: You can look at this config file with 'kubectl -n kube-system get cm kubeadm-config -oyaml'
[kubelet-start] Downloading configuration for the kubelet from the "kubelet-config-1.14" ConfigMap in the kube-system namespace
[kubelet-start] Writing kubelet configuration to file "/var/lib/kubelet/config.yaml"
[kubelet-start] Writing kubelet environment file with flags to file "/var/lib/kubelet/kubeadm-flags.env"
[kubelet-start] Activating the kubelet service
[kubelet-start] Waiting for the kubelet to perform the TLS Bootstrap...

This node has joined the cluster:
* Certificate signing request was sent to apiserver and a response was received.
* The Kubelet was informed of the new secure connection details.

Run 'kubectl get nodes' on the control-plane to see this node join the cluster.

查看节点,发现已经添加成功

  ~ kubectl get node -o wide
NAME       STATUS   ROLES    AGE   VERSION   INTERNAL-IP       EXTERNAL-IP   OS-IMAGE             KERNEL-VERSION      CONTAINER-RUNTIME
worker01   Ready    master   23h   v1.14.3   192.168.101.113   <none>        Ubuntu 16.04.6 LTS   4.4.0-150-generic   docker://18.9.5
worker02   Ready    <none>   33s   v1.14.3   192.168.100.117   <none>        Ubuntu 16.04.6 LTS   4.4.0-150-generic   docker://18.9.5
相关实践学习
容器服务Serverless版ACK Serverless 快速入门:在线魔方应用部署和监控
通过本实验,您将了解到容器服务Serverless版ACK Serverless 的基本产品能力,即可以实现快速部署一个在线魔方应用,并借助阿里云容器服务成熟的产品生态,实现在线应用的企业级监控,提升应用稳定性。
云原生实践公开课
课程大纲 开篇:如何学习并实践云原生技术 基础篇: 5 步上手 Kubernetes 进阶篇:生产环境下的 K8s 实践 相关的阿里云产品:容器服务&nbsp;ACK 容器服务&nbsp;Kubernetes&nbsp;版(简称&nbsp;ACK)提供高性能可伸缩的容器应用管理能力,支持企业级容器化应用的全生命周期管理。整合阿里云虚拟化、存储、网络和安全能力,打造云端最佳容器化应用运行环境。 了解产品详情:&nbsp;https://www.aliyun.com/product/kubernetes
目录
相关文章
|
22天前
|
Kubernetes 监控 数据安全/隐私保护
K8s好看的管理页面Rancher管理K8S
K8s好看的管理页面Rancher管理K8S
36 4
|
2月前
|
存储 Kubernetes 监控
Kubecost | Kubernetes 开支监控和管理 🤑🤑🤑
Kubecost | Kubernetes 开支监控和管理 🤑🤑🤑
|
11天前
|
JSON Kubernetes Go
无缝集成:在IntelliJ IDEA中利用Kubernetes插件轻松管理容器化应用
无缝集成:在IntelliJ IDEA中利用Kubernetes插件轻松管理容器化应用
21 0
无缝集成:在IntelliJ IDEA中利用Kubernetes插件轻松管理容器化应用
|
1月前
|
存储 Kubernetes 监控
容器服务ACK常见问题之容器服务ACK worker节点选择不同地域失败如何解决
容器服务ACK(阿里云容器服务 Kubernetes 版)是阿里云提供的一种托管式Kubernetes服务,帮助用户轻松使用Kubernetes进行应用部署、管理和扩展。本汇总收集了容器服务ACK使用中的常见问题及答案,包括集群管理、应用部署、服务访问、网络配置、存储使用、安全保障等方面,旨在帮助用户快速解决使用过程中遇到的难题,提升容器管理和运维效率。
|
1月前
|
Kubernetes 监控 Linux
容器服务ACK常见问题之新增一台CentOS 5.4内核的节点失败如何解决
容器服务ACK(阿里云容器服务 Kubernetes 版)是阿里云提供的一种托管式Kubernetes服务,帮助用户轻松使用Kubernetes进行应用部署、管理和扩展。本汇总收集了容器服务ACK使用中的常见问题及答案,包括集群管理、应用部署、服务访问、网络配置、存储使用、安全保障等方面,旨在帮助用户快速解决使用过程中遇到的难题,提升容器管理和运维效率。
|
1月前
|
Kubernetes Cloud Native Devops
云原生技术落地实现之二KubeSphere DevOps 系统在 Kubernetes 集群上实现springboot项目的自动部署和管理 CI/CD (2/2)
云原生技术落地实现之二KubeSphere DevOps 系统在 Kubernetes 集群上实现springboot项目的自动部署和管理 CI/CD (2/2)
50 1
|
1月前
|
Kubernetes 容器
k8s集群部署成功后某个节点突然出现notready状态的问题原因分析和解决办法
k8s集群部署成功后某个节点突然出现notready状态的问题原因分析和解决办法
15 0
|
2月前
|
Kubernetes Unix Docker
k8s管理docker
k8s管理docker
|
3月前
|
存储 Kubernetes API
Kubernetes:现代应用部署与管理的新篇章
Kubernetes:现代应用部署与管理的新篇章
122 0
|
3月前
|
Kubernetes 调度 Docker
容器化管理k8s部署踩坑记录
容器化管理k8s部署踩坑记录