Kubernetes调度
# Kubernetes调度
# 创建一个Pod的工作流程
Kubernetes基于list-watch机制的控制器架构,实现组件间交互的解耦。
创建一个Pod的工作流程:
kubectl run > apiserver > etcd > scheduler > kubelet > docker > container
- kubectl 发起一个创建Pod请求
- apiserver收到创建Pod的请求,将请求的配置写到etcd。
- scheduler通过list/watch获取到Pod配置,根据Pod配置选择一个合适节点,然后将选择结果返回给apiserver。
- kubelet获取绑定到我节点的Pod
- kubelet调用容器引擎api创建容器 并将结果返回给apiserver。
controller-manager和kube-proxy
- 试想,当创建一个deployment,controller-manager作用?
- kube-proxy是service具体实现的组件
# Pod中影响调度的主要属性
- 查看一个yaml文件
apiVersion: apps/v1
kind: Deployment
metadata:
name: web
namespace: default
spec:
...
containers:
- image: lizhenliang/java-demo
name: java-demo
imagePullPolicy: Always
livenessProbe:
initialDelaySeconds: 30
periodSeconds: 20
tcpSocket:
port: 8080
resources: {} #资源调度的依据
restartPolicy: Always
schedulerName: default-scheduler #调度策略
nodeName: ""
nodeSelector: {}
affinity: {}
tolerations: []
# 资源限制对Pod调度的影响
resources:容器资源限制
资源限制的键值:
- resources.limits.cpu
- resources.limits.memory
容器使用的最小资源需求,作为容器调度时资源分配的依据。
- resources.requests.cpu
- resources.requests.memory
注: CPU单位: 可以写成m也可以写成浮点数,例如0.5=500吗,1=1000m
m 毫核
- 0.5 = 500m
- 1c = 1000m
- 2c = 2000m
查看某个节点所有容器资源分配情况:
[root@master ~]# kubectl describe node k8s-node1
- K8s会根据Request的值去查找有足够的资源的Node来调度此Pod:
[root@master ~]# kubectl run web --image=nginx --limits memory=128Mi,cpu=500m --requests memory=64Mi,cpu=250m -o yaml --dry-run > pod.yaml
[root@master ~]# [root@master ~]# cat pod.yaml
apiVersion: v1
kind: Pod
metadata:
creationTimestamp: null
labels:
run: web
name: web
spec:
containers:
- image: nginx
name: web
resources:
limits: #容器的资源限制
cpu: 500m
memory: 128Mi
requests: #容器的资源请求 "requests不能大于limits"
cpu: 250m
memory: 64Mi
dnsPolicy: ClusterFirst
restartPolicy: Always
status: {}
[root@master ~]# kubectl apply -f pod.yaml
pod/web configured
[root@master ~]# kubectl get pods
NAME READY STATUS RESTARTS AGE
web 1/1 Running 0 17m
# nodeSelector && nodeAffinity
nodeSelector: 节点选择器,用于将Pod调度到匹配Label的Node上,如果没有匹配的标签会调度失败。
作用:
完全匹配节点标签
固定Pod的特定的节点
给节点打标签
[root@master ~]# kubectl label nodes node1 type=os
node/node1 labeled
- 查看node的标签
[root@master ~]# kubectl get node --show-labels
[root@master ~]# kubectl get node -l type=os
NAME STATUS ROLES AGE VERSION
node1 Ready <none> 5h36m v1.20.0
- 创建Pod配置节点选择器打上标签
[root@master ~]# kubectl run nginx --image=nginx:1.19 -o yaml --dry-run > pod2.yaml
[root@master ~]# cat pod2.yaml
apiVersion: v1
kind: Pod
metadata:
labels:
run: nginx
name: nginx
spec:
nodeSelector:
disktype: "ssd"
containers:
- image: nginx:1.19
name: nginx
restartPolicy: Always
- 查看Pod的状态
[root@master ~]# kubectl get pods
NAME READY STATUS RESTARTS AGE
nginx 0/1 Pending 0 2m38s
web 1/1 Running 0 77m
- 配置节点标签
[root@master ~]# kubectl label nodes node1 disktype=ssd
node/node1 labeled
- 再次查看Pod的状态
[root@master ~]# kubectl get pods
NAME READY STATUS RESTARTS AGE
nginx 1/1 Running 0 6m8s
web 1/1 Running 0 80m
- 给节点删除标签的命令
[root@master ~]# kubectl label nodes node1 disktype-
node/node1 labeled
nodeAffinity: 节点亲和性,节点亲和性类似于nodeSelector,可以根据节点上的标签来约束Pod可以调度哪些节点。
相比nodeSelector:
匹配有更多的逻辑组合,不只是字符串完全相等
调度分为软策略和硬策略,而不是硬性要求。
- 硬(required): 必须满足
- 软(preferred): 尝试满足, 但不保证
操作符号: In、Notln、Exists、DoesNotExist、Gt、Lt
硬亲和性测试
[root@master ~]# cat nodeaffinity.yaml
apiVersion: v1
kind: Pod
metadata:
name: nginx
spec:
affinity:
nodeAffinity: #使用节点亲和性
requiredDuringSchedulingIgnoredDuringExecution: #配置亲和性的选项
nodeSelectorTerms:
- matchExpressions: #匹配表达式
- key: gpu #键值
operator: In #操作符 标识这个节点必须包含gpu
values: #定义gpu的对应键值
- nvidia-testla
containers:
- image: nginx:1.19
name: nginx
#此时我的节点是并没有被调度,因为没有节点有标签满足
[root@master ~]# kubectl apply -f nodeaffinity.yaml
pod/nginx created
[root@master ~]# kubectl get pods
NAME READY STATUS RESTARTS AGE
nginx 0/1 Pending 0 3s
web 1/1 Running 0 128m
尝试加上节点的标签
给node1打上标签
可以很清楚的发现此时硬亲和性生效,Pod从Pending到Running
[root@master ~]# kubectl label nodes node1 gpu=nvidia-testla
node/node1 labeled
[root@master ~]# kubectl get pods
NAME READY STATUS RESTARTS AGE
nginx 1/1 Running 0 4m49s
web 1/1 Running 0 133m
软亲和性测试
[root@master ~]# cat nodeaffinity.yaml
apiVersion: v1
kind: Pod
metadata:
name: nginx
spec:
affinity:
nodeAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 1 #指定一个权重值 设置的是值越大越优先的带动节点分配
preference:
matchExpressions: #表达式
- key: gpu
operator: In
values:
- "yes"
containers:
- image: nginx:1.19
name: nginx
- 修改node节点的标签
[root@master ~]# kubectl label nodes node1 gpu=nvidia-testla
node/node1 labeled
- 执行yaml文件发现标签并没有完全匹配,但是可以调度
[root@master ~]# kubectl apply -f nodeaffinity.yaml
pod/nginx created
[root@master ~]# kubectl get pods
NAME READY STATUS RESTARTS AGE
nginx 1/1 Running 0 3s
web 1/1 Running 0 156m
# Taints & Toleratiions
Taints:污点,避免Pod调度特点的Node上
应用场景:
- 专用的节点,例如配备了特殊硬件的节点
- 基于Taint的驱逐
设置污点:
[root@master ~]# kubectl taint node [node] key=value:[effect]
期中[effect]可取值:
- NoSchedule: 一定不能被调度
- PreferNoSchedule: 尽量不要被调度
- NoExecute: 不仅不会调度,还会驱逐Node上已有的Pod
去掉污点:
[root@master ~]# kubectl taint node [node] key:[effect]-
- 设置污点属性为一定不能被调度
# 查看当前的Pod是运行在node2上面,所以我们需要给node2打上污点
[root@master ~]# kubectl get pod -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
nginx 1/1 Running 0 11h 10.244.104.5 node2 <none> <none>
web 1/1 Running 0 14h 10.244.104.2 node2 <none> <none>
#给node2上面打上污点
[root@master ~]# kubectl taint node node2 gpu=yes:NoSchedule
node/node2 tainted
#查看当前的node所有污点
[root@master ~]# kubectl describe node | grep Taint
Taints: node-role.kubernetes.io/master:NoSchedule
Taints: <none>
Taints: gpu=yes:NoSchedule
- 运行Pod再次查看结果
[root@master ~]# kubectl apply -f nodeaffinity.yaml
pod/nginx created
[root@master ~]# kubectl get pods
NAME READY STATUS RESTARTS AGE
nginx 1/1 Running 0 9s
web 1/1 Running 0 14h
[root@master ~]# kubectl get pods -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
nginx 1/1 Running 0 16s 10.244.166.131 node1 <none> <none>
web 1/1 Running 0 14h 10.244.104.2 node2 <none> <none>
- 删除污点
[root@master ~]# kubectl taint node node2 gpu=yes:NoSchedule-
node/node2 untainted
Tolerations: 污点容忍,允许Pod调度到持有Taints的Node上。
- 把node2节点加入污点,此时先不用打上标签
[root@master ~]# kubectl taint node node2 gpu=yes:NoSchedule
node/node2 tainted
- 运行一个带标签和带污点容忍的Pod
[root@master ~]# cat pod.yaml
apiVersion: v1
kind: Pod
metadata:
name: pod-tolerations
spec:
nodeSelector: #节点标签选择器,标签选择ssd
disktype: "ssd"
containers:
- image: nginx
name: nginx
tolerations: #配置污点容忍
- key: "gpu" #key是对应的gpu
operator: "Equal" #操作符
value: "yes" #gpu的对应键值是yes 相当于容忍node2污点的gpu=yes:NoSchedule
effect: "NoSchedule" #使用的是对应的NoSchedule,因为污点也是该选项
[root@master ~]# kubectl apply -f pod.yaml
pod/pod-tolerations configured
[root@master ~]# kubectl describe pod pod-tolerations
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedScheduling 7m30s default-scheduler 0/3 nodes are available: 1 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate, 2 node(s) didn't match Pod's node affinity. #可以发现这里是没有Node可以调度
- 给Node2打上标签,disktype=ssd
[root@master ~]# kubectl label nodes node2 disktype=ssd
node/node2 labeled
- 再次检查Pod状态
[root@master ~]# kubectl get pods
NAME READY STATUS RESTARTS AGE
pod-tolerations 1/1 Running 0 8m34s
# nodeName
nodeName: 指定节点名称,用于将Pod调度到指定的Node上,不经过调度器
- 编写指定nodeName的pod名称:
[root@master ~]# cat pod.yaml
apiVersion: v1
kind: Pod
metadata:
name: pod-example
spec:
nodeName: node1
containers:
- image: nginx
name: nginx
- 查看Pod的运行状态
[root@master ~]# kubectl get pods -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
pod-example 1/1 Running 0 37s 10.244.166.132 node1 <none> <none>
# DaemonSet控制器
DaemonSet: 节点副本控制器
- 功能:
- 在每一个Node上面运行一个Pod
- 新加入的node也同样会自动运行一个pod
应用场景: 网络插件,监控Agent,日志Agent
快速生成一个Daemonset的yaml
先生成一个deployment的yaml文件
[root@master ~]# kubectl create deployment nginx --image=nginx:1.19 --dry-run -o yaml > Daemonset.yaml
[root@master ~]# cat Daemonset.yaml
apiVersion: apps/v1
kind: DaemonSet #修改一下类型
metadata:
labels:
app: nginx
name: nginx
spec:
# replicas: 1 #删除一下副本数
selector:
matchLabels:
app: nginx
# strategy: {} #删除一下滚动升级
template:
metadata:
labels:
app: nginx
spec:
containers:
- image: nginx:1.19
name: nginx
- 查看现在的节点数量
[root@master ~]# kubectl get node
NAME STATUS ROLES AGE VERSION
master Ready control-plane,master 24h v1.20.0
node1 Ready <none> 19h v1.20.0
node2 Ready <none> 2m23s v1.20.0
- 运行Pod之后查看daemonset的数量
[root@master ~]# kubectl apply -f Daemonset.yaml
daemonset.apps/nginx unchanged
[root@master ~]# kubectl get dae
error: the server doesn't have a resource type "dae"
#这里的Pod只有两个,是因为我有两个node节点数量,直接分配两个
[root@master ~]# kubectl get daemonSet
NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE
nginx 2 2 2 2 2 <none> 10m
- 尝试把master节点也加入到容忍污点
[root@master ~]# cat Daemonset.yaml
apiVersion: apps/v1
kind: DaemonSet
metadata:
labels:
app: nginx
name: nginx
spec:
selector:
matchLabels:
app: nginx
template:
metadata:
labels:
app: nginx
spec:
tolerations: #添加污点容忍
- effect: NoSchedule #添加污点容忍的选项
operator: Exists #添加操作符号
containers:
- image: nginx:1.19
name: nginx
- 运行yaml文件再次查看Pod的数量和daemonset
[root@master ~]# kubectl apply -f Daemonset.yaml
daemonset.apps/nginx configured
[root@master ~]# kubectl get daemonsets
NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE
nginx 3 3 3 3 3 <none> 18m
# 调度失败原因解析
查看调度结果:
[root@master ~]# kubectl get pod -o wide
调度失败原因:
[root@master ~]# kubectl describe pod <NAME>
- 节点CPU/内存不足
0/3 nodes are available: 1 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate, 2 Insufficient cpu.
- 有污点,没容忍
0/3 nodes are available: 1 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate, 2 node(s) didn't match Pod's node affinity.
- 没有匹配到节点标签
0/3 nodes are available: 1 node(s) didn't match Pod's node affinity, 1 node(s) had taint {gpu: yes}, that the pod didn't tolerate, 1 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate.