安装k8s 1.9.0 实践:问题集锦

  1. 云栖社区>
  2. 博客>
  3. 正文

安装k8s 1.9.0 实践:问题集锦

店家小二 2018-12-16 11:38:00 浏览1829
展开阅读全文

k8s 1.5 与 k8s 1.9的差别

参照以前安装kubernetes 1.5.2失败,原因是docker包冲突。在查 看高版本安装过程中发现,高版本kubernetes不再打包安装docker,而是需要用户先自行安装好docker服务。

机器上已经安装了 Docker version 17.12.0-ce, build c97c6d6

再安装kubernetes (kubernetes.x86_64 1.5.2-0.7.git269f928.el7) 时失败。

错误:docker-ce conflicts with 2:docker-1.12.6-71.git3e8e77d.el7.centos.1.x86_64

您可以尝试添加 --skip-broken 选项来解决该问题

您可以尝试执行:rpm -Va --nofiles --nodigest

猜测可能因为版本问题,故去网上搜索安装更高级版本方法。结果如下:

“但是在kubernetes1.6之后,安装就比较繁琐了,需要证书各种认证,对于刚接触kubernetes的人来说很不友好,按照官方文档在本地安装“集群”的的话,我觉得你肯定是跑不起来的,除非你突破了GFW的限制,还要懂得怎么样不断修改参数。”

意思是k8s 1.6之后的安装与之前可能有比较大的差异。google被墙,需要预先下载很多docker镜像。

以下三篇文章安装k8s 1.7.5,由于缺乏docker镜像,安装失败。

https://www.cnblogs.com/liangDream/p/7358847.html

http://www.bubuko.com/infodetail-2375091.html

https://www.kubernetes.org.cn/3063.html

docker安装问题

docker版本选择

kubernetes1.9.0 最高支持docker17.03 目前装的17.12太高了 要降级。

Kubernetes对Docker的版本支持列表 http://blog.csdn.net/csdn_duomaomao/article/details/79171027

删除docker

[root@tensorflow0 hdzhou]# yum remove docker \

docker-common \

docker-selinux \

docker-engine

======================================================================================================================================================================================

Package 架构 版本 源 大小

======================================================================================================================================================================================

正在删除:

container-selinux noarch 2:2.36-1.gitff95335.el7 @extras 34 k

为依赖而移除:

docker-ce x86_64 17.12.0.ce-1.el7.centos installed 123 M

nvidia-docker2 noarch 2.0.2-1.docker17.12.0.ce @nvidia-docker 2.3 k

事务概要

======================================================================================================================================================================================

移除 1 软件包 (+2 依赖软件包)

docker启动失败问题

2月 26 16:42:00 tensorflow0 dockerd[8717]: time="2018-02-26T16:42:00.315096986+08:00" level=info msg="libcontainerd: new containerd process, pid: 8725"

2月 26 16:42:01 tensorflow0 dockerd[8717]: time="2018-02-26T16:42:01.319051277+08:00" level=error msg="[graphdriver] prior storage driver overlay2 failed: driver not supported"

2月 26 16:42:01 tensorflow0 dockerd[8717]: Error starting daemon: error initializing graphdriver: driver not supported

2月 26 16:42:01 tensorflow0 systemd[1]: docker.service: main process exited, code=exited, status=1/FAILURE

2月 26 16:42:01 tensorflow0 systemd[1]: Failed to start Docker Application Container Engine.

解决:

sudo mv /var/lib/docker /var/lib/docker.old

参考:https://stackoverflow.com/questions/33357824/prior-storage-driver-aufs-failed-driver-not-supported-error-starting-daemon

k8s安装问题

rpm安装

rpm -ivh socat-1.7.3.2-2.el7.x86_64.rpm

rpm -ivh kubernetes-cni-0.6.0-0.x86_64.rpm kubelet-1.9.9-9.x86_64.rpm kubectl-1.9.0-0.x86_64.rpm

rpm -ivh kubectl-1.9.0-0.x86_64.rpm

rpm -ivh kubeadm-1.9.0-0.x86_64.rpm

rpm删除

rpm -e 文件名 --nodeps

eg:

rpm -e socat-1.7.3.2-2.el7.x86_64 --nodeps

rpm -e kubernetes-cni-0.6.0-0.x86_64 --nodeps

rpm -e kubelet-1.9.0-0.x86_64 --nodeps

rpm -e kubectl-1.9.0-0.x86_64 --nodeps

rpm -e kubeadm-1.9.0-0.x86_64.rpm --nodeps

查看报错信息

cat /var/log/messages

journalctl -xeu kubelet

kubelet启动后 ca文件不存在是正常的,在后续步骤 kubeadm init执行后会生成ca文件。

kubelet启动后在不停重启是正常的!

The kubelet is now restarting every few seconds, as it waits in a crashloop for kubeadm to tell it what to do. This crashloop is expected and normal, please proceed with the next step and the kubelet will start running normally.

初始化集群

kubeadm init --kubernetes-version=v1.9.0 --pod-network-cidr=10.244.0.0/16

务必记录如下信息,每次生成都不一样

eg:

kubeadm join --token 5ce44e.47b6dc4e4b66980f 192.168.1.138:6443 --discovery-token-ca-cert-hash sha256:9d7eac82d66744405c783de5403e1f2bb7191b4c1b350d721b7b8570c62ff83a

token重新获取

kubeadm token list

或者

kubeadm token create

token 24小时后过期,超过时间需要重新获取

sha256获取方式 master节点执行:

openssl x509 -pubkey -in /etc/kubernetes/pki/ca.crt | openssl rsa -pubin -outform der 2>/dev/null | openssl dgst -sha256 -hex | sed 's/^.* //'

kubeadm init

[root@tensorflow0 etc]# kubeadm init --kubernetes-version=v1.9.0 --pod-network-cidr=10.244.0.0/16

[init] Using Kubernetes version: v1.9.0

[init] Using Authorization modes: [Node RBAC]

[preflight] Running pre-flight checks.

[preflight] Some fatal errors occurred:

[preflight] If you know what you are doing, you can make a check non-fatal with --ignore-preflight-errors=...

[root@tensorflow0 etc]#

命令后面增加 --ignore-preflight-errors 'Swap' 或者 --ignore-preflight-errors all (这是不好的)

Port 2379 is in use 因为没有执行 kubeadm reset

查看错误:

kubectl get pod kube-proxy-d2p7p -o wide --namespace=kube-system

kubectl describe pod kube-proxy-d2p7p --namespace=kube-system

修改kubelet配置,启动kubelet(所有节点)

注意:时刻查看/var/log/messages的日志输出,会看到kubelet一直启动失败。

cat /etc/systemd/system/kubelet.service.d/10-kubeadm.conf

编辑10-kubeadm.conf的文件,修改cgroup-driver配置:

[root@centos7-base-ok]# cat /etc/systemd/system/kubelet.service.d/10-kubeadm.conf[Service]Environment="KUBELET_KUBECONFIG_ARGS=--kubeconfig=/etc/kubernetes/kubelet.conf --require-kubeconfig=true"Environment="KUBELET_SYSTEM_PODS_ARGS=--pod-manifest-path=/etc/kubernetes/manifests --allow-privileged=true"Environment="KUBELET_NETWORK_ARGS=--network-plugin=cni --cni-conf-dir=/etc/cni/net.d --cni-bin-dir=/opt/cni/bin"Environment="KUBELET_DNS_ARGS=--cluster-dns=10.96.0.10 --cluster-domain=cluster.local"Environment="KUBELET_AUTHZ_ARGS=--authorization-mode=Webhook --client-ca-file=/etc/kubernetes/pki/ca.crt"Environment="KUBELET_CADVISOR_ARGS=--cadvisor-port=0"Environment="KUBELET_CGROUP_ARGS=--cgroup-driver=cgroupfs"ExecStart=

ExecStart=/usr/bin/kubelet $KUBELET_KUBECONFIG_ARGS $KUBELET_SYSTEM_PODS_ARGS $KUBELET_NETWORK_ARGS $KUBELET_DNS_ARGS $KUBELET_AUTHZ_ARGS $KUBELET_CADVISOR_ARGS $KUBELET_CGROUP_ARGS $KUBELET_EXTRA_ARGS

Environment="KUBELET_SWAP_ARGS=--fail-swap-on=false" 1.8开始,如果机器开启了swap,kubulet会无法启动,默认参数是true。 可以在kubelet里配置swap false 也可以直接关闭机器的swap。关闭方法见下文。

将“--cgroup-driver=systems”修改成为“--cgroup-driver=cgroupfs”

这里需要主意的是要看一下docker的cgroup driver与 --cgroup-driver要一致。 可以用 docker info |grep Cgroup 查看,有可能是systemd 或者 cgroupfs

重新启动kubelet

[root@centos7-base-ok]# systemctl restart kubelet

[preflight] Running pre-flight checks.

[preflight] Some fatal errors occurred:

关闭swap

swapoff -a

设置永久关闭swap

修改/etc/fstab中内容,将swap哪一行用#注释掉。

https://zhidao.baidu.com/question/2011273820596440908.html

删除etcd

yum erase etcd.

删除etcd文件夹 mv /var/lib/etcd /var/lib/etcd.bak

The connection to the server localhost:8080 was refused - did you specify the right host or port?

export KUBECONFIG=/etc/kubernetes/admin.conf

定义在6443端口 而不是8080

runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:docker: network plugin is not ready: cni config uninitialized

http://windgreen.me/

kube-dns 启动不成功

kube-system po/kube-dns-6f4fd4bdf-p5x4k 0/3 Pending 0 14m

修改 /etc/systemd/system/kubelet.service.d/10-kubeadm.conf

删除$KUBELET_NETWORK_ARGS 别这么做

dns异常,kubeadm reset重来,试试先初始化master,然后配置flannel网络,ok了以后,再加入其它机器

重启

systemctl daemon-reload && systemctl restart kubelet

kubeadm reset

kubeadm init --kubernetes-version=v1.9.0 --pod-network-cidr=10.244.0.0/16

kube-proxy 启动不成功

原因同 no IP addresses available

多次启动集群,虚拟ip用完了。

no IP addresses available

E1216 23:50:16.116098 28152 pod_workers.go:186] Error syncing pod 6f5b9673-e2b5-11e7-a0f5-001e67d35991 ("kube-dns-6f4fd4bdf-xrj4w_kube-system(6f5b9673-e2b5-11e7-a0f5-001e67d35991)"), skipping: failed to "CreatePodSandbox" for "kube-dns-6f4fd4bdf-xrj4w_kube-system(6f5b9673-e2b5-11e7-a0f5-001e67d35991)" with CreatePodSandboxError: "CreatePodSandbox for pod "kube-dns-6f4fd4bdf-xrj4w_kube-system(6f5b9673-e2b5-11e7-a0f5-001e67d35991)" failed: rpc error: code = Unknown desc = NetworkPlugin cni failed to set up pod "kube-dns-6f4fd4bdf-xrj4w_kube-system" network: failed to allocate for range 0: no IP addresses available in range set: 10.244.0.1-10.244.0.254"

多次启动集群,虚拟ip用完了。

kubeadm reset

rm -rf /var/lib/cni/flannel/*

rm -rf /var/lib/cni/networks/cbr0/*

ip link delete cni0 flannel.1

重启!!!! kubeadm reset多了 网络开辟可能有什么残留 重启能清空。

https://github.com/kubernetes/kubernetes/issues/57280

这两个问题都是和网络有关,都是因为虚拟网络问题导致服务启动不正常。原因是多次kubeadm reset 多次重新启动flannel(或者其他网络),reset可能清理不彻底,导致多次reset后出现ip用完等问题。解决办法是先reset,然后删除文件夹和配置,重启机器(可能不用),一般是报错的机器这样做,也可以每台机器都要做。重新初始化k8s集群,即可。

pod ContainerCreating

查看pod情况发现pod起不来

default po/httpd-68f9d7648d-5f9gt 0/1 ContainerCreating 0 1m tensorflow0

describe一下 说sadbox创建失败。

Warning FailedCreatePodSandBox 20s (x12 over 54s) kubelet, tensorflow0 Failed create pod sandbox.

Normal SandboxChanged 20s (x12 over 53s) kubelet, tensorflow0 Pod sandbox changed, it will be killed and re-created.

到那台起不来的机器上去看kubelet状态。

发现

Error while adding to cni network: failed to set bridge addr: "cni0" already has an IP address different from 10.244.2.1/24。

Error while adding to cni network: failed to allocate for range 0: no IP addresses available in range set: 10.244.2.1-10.244.2.254

[root@tensorflow0 ~]# systemctl status kubelet

● kubelet.service - kubelet: The Kubernetes Node Agent

Loaded: loaded (/etc/systemd/system/kubelet.service; enabled; vendor preset: disabled)

Drop-In: /etc/systemd/system/kubelet.service.d

└─10-kubeadm.conf

Active: active (running) since 四 2018-03-22 14:49:29 CST; 4min 12s ago

Docs: http://kubernetes.io/docs/

Main PID: 3873 (kubelet)

Memory: 45.0M

CGroup: /system.slice/kubelet.service

├─ 3873 /usr/bin/kubelet --bootstrap-kubeconfig=/etc/kubernetes/bootstrap-kubelet.conf --kubeconfig=/etc/kubernetes/kubelet.conf --feature-gates=DevicePlugins=true --pod-manifest-path=/etc/kubernetes/manifests --allow-privileged=true --network-plugin=cni -...

├─11665 /opt/cni/bin/flannel

└─11670 /opt/cni/bin/bridge

3月 22 14:53:35 tensorflow0 kubelet[3873]: E0322 14:53:35.990200 3873 kuberuntime_manager.go:647] createPodSandbox for pod "httpd-68f9d7648d-5f9gt_default(39f66066-2d9d-11e8-bf17-98eecb73f4db)" failed: rpc error: code = Unknown desc = NetworkPlugin cni failed to ...

3月 22 14:53:35 tensorflow0 kubelet[3873]: E0322 14:53:35.990287 3873 pod_workers.go:186] Error syncing pod 39f66066-2d9d-11e8-bf17-98eecb73f4db ("httpd-68f9d7648d-5f9gt_default(39f66066-2d9d-11e8-bf17-98eecb73f4db)"), skipping: failed to "Cre...bf17-98eecb73f4db)"

3月 22 14:53:37 tensorflow0 kubelet[3873]: W0322 14:53:37.041536 3873 pod_container_deletor.go:77] Container "73c43b8766686c64d31bdd0533604d1d349ebe08f95d7463d23ebdffe377113e" not found in pod's containers

3月 22 14:53:39 tensorflow0 kubelet[3873]: E0322 14:53:39.621047 3873 cni.go:259] Error adding network: failed to set bridge addr: "cni0" already has an IP address different from 10.244.2.1/24

3月 22 14:53:39 tensorflow0 kubelet[3873]: E0322 14:53:39.621083 3873 cni.go:227] Error while adding to cni network: failed to set bridge addr: "cni0" already has an IP address different from 10.244.2.1/24

3月 22 14:53:39 tensorflow0 kubelet[3873]: E0322 14:53:39.809286 3873 remote_runtime.go:92] RunPodSandbox from runtime service failed: rpc error: code = Unknown desc = NetworkPlugin cni failed to set up pod "httpd-68f9d7648d-5f9gt_default" net...t from 10.244.2.1/24

3月 22 14:53:39 tensorflow0 kubelet[3873]: E0322 14:53:39.809337 3873 kuberuntime_sandbox.go:54] CreatePodSandbox for pod "httpd-68f9d7648d-5f9gt_default(39f66066-2d9d-11e8-bf17-98eecb73f4db)" failed: rpc error: code = Unknown desc = NetworkPlugin cni failed to s...

3月 22 14:53:39 tensorflow0 kubelet[3873]: E0322 14:53:39.809360 3873 kuberuntime_manager.go:647] createPodSandbox for pod "httpd-68f9d7648d-5f9gt_default(39f66066-2d9d-11e8-bf17-98eecb73f4db)" failed: rpc error: code = Unknown desc = NetworkPlugin cni failed to ...

3月 22 14:53:39 tensorflow0 kubelet[3873]: E0322 14:53:39.809424 3873 pod_workers.go:186] Error syncing pod 39f66066-2d9d-11e8-bf17-98eecb73f4db ("httpd-68f9d7648d-5f9gt_default(39f66066-2d9d-11e8-bf17-98eecb73f4db)"), skipping: failed to "Cre...bf17-98eecb73f4db)"

3月 22 14:53:40 tensorflow0 kubelet[3873]: W0322 14:53:40.063548 3873 pod_container_deletor.go:77] Container "f1b063e5245c7a5c8527d1426858781c6554bcb06d987c7f472cfd0c41290110" not found in pod's containers

Hint: Some lines were ellipsized, use -l to show in full.

解决:

干掉cni-flannel,停运集群.清理环境.

rm -rf /var/lib/cni/flannel/ && rm -rf /var/lib/cni/networks/cbr0/ && ip link delete cni0

rm -rf /var/lib/cni/networks/cni0/*

把报错的那台清理了就行了。

加入节点

节点加入不报错 但是主节点看不到,因为kubelet 启动失败 ,也要修改cgroup-driver

重启kubelet

再次kubeadm join xxx

报错

[preflight] Running pre-flight checks.

[preflight] Some fatal errors occurred:

删除存在文件即可

kubeadm join 前需要 kubectl reset

删除节点

master执行 kubectl delete node {nodename}

eg:

kubectl delete node tensorflow0

节点执行 kubectl reset

master执行了删除节点操作

k8s + gpu

https://github.com/NVIDIA/k8s-device-plugin#preparing-your-gpu-nodes

注意要设置default-runtime

容器没有启动在虚拟网络上

设置了虚拟网段 10.244.0.0/16

容器应该启动在虚拟网段上,每个容器一个ip,现在环境2并不是这样。这样就不能准确的指定ip,分布式tf任务跑不成。
2018_12_16_113605

解决方案同kube-dns 启动不成功

推荐打开,不打开我没发现什么问题。有时候,莫名就变成0了,就报错了,还是配置好比较好。

echo 'net.bridge.bridge-nf-call-iptables=1' >> /etc/sysctl.conf

sysctl -p

net.ipv4.ip_forward = 1

net.bridge.bridge-nf-call-iptables = 1

net.bridge.bridge-nf-call-ip6tables = 1

集群连接不上了

[root@tensorflow1 influxdb]# kubectl get all -o wide -n kube-system

error: {batch cronjobs} matches multiple kinds [batch/v1beta1, Kind=CronJob batch/v2alpha1, Kind=CronJob]

原因是 ~/.bash_profile 里配置的k8s属性丢失了。

启动nvidia-device-plugin-daemonset失败

Error response from daemon: oci runtime error: container_linux.go:247: starting container process caused "process_linux.go:337: running prestart hook 1 caused "error running hook: exit status 1, stdout: , stderr: exec command: [/usr/bin/nvidia-container-cli --load-kmods configure --ldconfig=@/sbin/ldconfig --device=all --utility --pid=12545 /data1/docker/overlay/10be1d599f91da020b7bfced8058533bb6129b637871ea61e0547ecb8758b3a2/merged]\n Error in `/usr/bin/nvidia-container-cli': double free or corruption (!prev): 0x000055c6961daa10 \n======= Backtrace: =========\n/lib64/libc.so.6(+0x7c619)[0x7f5aa0af0619]\n/usr/lib64/nvidia/libcuda.so.1(+0x2edd7c)[0x7f5a9fb77d7c]\n/usr/lib64/nvidia/libcuda.so.1(+0x2eddc3)[0x7f5a9fb77dc3]\n/usr/lib64/nvidia/libcuda.so.1

发现gpu已经被占用,先清理干净,再启动就没问题了。

本文转自CSDN-安装k8s 1.9.0 实践:问题集锦

网友评论

登录后评论
0/500
评论
店家小二
+ 关注