kubectl top 和 cadvisor metric ，docker state不一致的问题

kubectl top 、 k8s dashboard 以及 HPA 等调度组件使用的数据是一样，数据链路如下：

cgroup

cgroup文件中的值是监控数据的最终来源，如

mem usage的值，来自于 /sys/fs/cgroup/memory/docker/[containerId]/memory.usage_in_bytes
如果没限制内存，Limit = machine_mem，否则来自于 /sys/fs/cgroup/memory/docker/[id]/memory.limit_in_bytes
内存使用率 = memory.usage_in_bytes/memory.limit_in_bytes

一般情况下，cgroup文件夹下的内容包括CPU、内存、磁盘、网络等信息：

kubectl top pod 内存怎么计算，包含 pause容器吗

每次启动 pod，都会有一个 pause 容器，既然是容器就一定有资源消耗（一般在 2-3M 的内存），cgroup 文件中，业务容器和 pause 容器都在同一个 pod的文件夹下。

但 cadvisor 在查询 pod 的内存使用量时，是先获取了 pod 下的container列表，再逐个获取container的内存占用，不过这里的 container 列表并没有包含 pause，因此最终 top pod 的结果也不包含 pause 容器

pod 的内存使用量计算

kubectl top pod 得到的内存使用量，并不是cadvisor 中的container_memory_usage_bytes，而是container_memory_working_set_bytes，计算方式为：

container_memory_usage_bytes == container_memory_rss + container_memory_cache + kernel memory
container_memory_working_set_bytes = container_memory_usage_bytes - total_inactive_file（未激活的匿名缓存页）

container_memory_working_set_bytes是容器真实使用的内存量，也是limit限制时的 oom 判断依据

cadvisor 中的 container_memory_usage_bytes对应 cgroup 中的 memory.usage_in_bytes文件，但container_memory_working_set_bytes并没有具体的文件，他的计算逻辑在 cadvisor 的代码中，如下：image

同理，node 的内存使用量也是container_memory_working_set_bytes

4.3 kubectl top node 怎么计算，和节点上直接 top 有什么区别

kubectl top node得到的 cpu 和内存值，并不是节点上所有 pod 的总和，不要直接相加。top node是机器上cgroup根目录下的汇总统计image

在机器上直接 top命令看到的值和 kubectl top node 不能直接对比，因为计算逻辑不同，如内存，大致的对应关系是(前者是机器上 top，后者是kubectl top):

rss + cache = (in)active_anon + (in)active_file

image

4.4 kubectl top pod 和exec 进入 pod 后看到的 top 不一样

top命令的差异和上边一致，无法直接对比，同时，就算你对 pod 做了limit 限制，pod 内的 top 看到的内存和 cpu总量仍然是机器总量，并不是pod 可分配量

进程的RSS为进程使用的所有物理内存（file_rss＋anon_rss），即Anonymous pages＋Mapped apges（包含共享内存）
cgroup RSS为（anonymous and swap cache memory），不包含共享内存。两者都不包含file cache

4.5 kubectl top pod 和 docker stats得到的值为什么不同？

docker stats dockerID 可以看到容器当前的使用量：image

如果你的 pod中只有一个 container，你会发现docker stats 值不等于kubectl top 的值，既不等于 container_memory_usage_bytes，也不等于container_memory_working_set_bytes。

因为docker stats 和 cadvisor 的计算方式不同，总体值会小于 kubectl top：计算逻辑是：

docker stats = container_memory_usage_bytes - container_memory_cache

五. 后记

一般情况下，我们并不需要时刻关心node 或 pod 的使用量，因为有集群自动扩缩容(cluster-autoscaler)和pod 水平扩缩容（HPA）来应对这两种资源变化，资源指标的意义更适合使用prometheus来持久化 cadvisor 的数据，用于回溯历史或者发送报警。

关于prometheus的内容可以看容器监控系列

其他补充：

虽然kubectl top help中显示支持Storage，但直到 1.16 版本仍然不支持
1.13 之前需要 heapster，1.13 以后需要metric-server，这部分kubectl top help的输出有误，里面只提到了heapster
k8s dashboard 中的监控图默认使用的是 heapster，切换为 metric-server后数据会异常，需要多部署一个metric-server-scraper 的 pod 来做接口转换，具体参考 pr：https://github.com/kubernetes/dashboard/pull/3504

六. 参考资料

作者：二二向箔链接：https://www.jianshu.com/p/64230e3b6e6c 来源：简书著作权归作者所有。商业转载请联系作者获得授权，非商业转载请注明出处。

Previous网络排查 Next容器挂载数据卷的几种情况

Last updated 5 years ago