> For the complete documentation index, see [llms.txt](https://darren.gitbook.io/project/llms.txt). Markdown versions of documentation pages are available by appending `.md` to page URLs; this page is available as [Markdown](https://darren.gitbook.io/project/k8s-yun-wei-pai-cha/pai-cuo-gong-ju.md).

# 排错工具

必备工具

* `kubectl`：用于查看 Kubernetes 集群以及容器的状态，如 `kubectl describe pod <pod-name>`
* `journalctl`：用于查看 Kubernetes 组件日志，如 `journalctl -u kubelet -l`
* `iptables`和`ebtables`：用于排查 Service 是否工作，如 `iptables -t nat -nL` 查看 kube-proxy 配置的 iptables 规则是否正常
* `tcpdump`：用于排查容器网络问题，如 `tcpdump -nn host 10.240.0.8`
* `perf`：Linux 内核自带的性能分析工具，常用来排查性能问题，如 [Container Isolation Gone Wrong](https://dzone.com/articles/container-isolation-gone-wrong) 问题的排查

## sysdig <a href="#sysdig" id="sysdig"></a>

sysdig 是一个容器排错工具，提供了开源和商业版本。对于常规排错来说，使用开源版本即可。

除了 sysdig，还可以使用其他两个辅助工具

* csysdig：与 sysdig 一起自动安装，提供了一个命令行界面
* ​[sysdig-inspect](https://github.com/draios/sysdig-inspect)：为 sysdig 保存的跟踪文件（如 `sudo sysdig -w filename.scap`）提供了一个图形界面（非实时）

### 安装 <a href="#an-zhuang" id="an-zhuang"></a>

```
# on Ubuntucurl -s https://s3.amazonaws.com/download.draios.com/DRAIOS-GPG-KEY.public | apt-key add -curl -s -o /etc/apt/sources.list.d/draios.list http://download.draios.com/stable/deb/draios.listapt-get updateapt-get -y install linux-headers-$(uname -r)apt-get -y install sysdig​# on REHLrpm --import https://s3.amazonaws.com/download.draios.com/DRAIOS-GPG-KEY.publiccurl -s -o /etc/yum.repos.d/draios.repo http://download.draios.com/stable/rpm/draios.reporpm -i http://mirror.us.leaseweb.net/epel/6/i386/epel-release-6-8.noarch.rpmyum -y install kernel-devel-$(uname -r)yum -y install sysdig​# on MacOSbrew install sysdig
```

### 示例 <a href="#shi-li" id="shi-li"></a>

```
# Refer https://www.sysdig.org/wiki/sysdig-examples/.# View the top network connectionssudo sysdig -pc -c topconns# View the top network connections inside the wordpress1 containersudo sysdig -pc -c topconns container.name=wordpress1​# Show the network data exchanged with the host 192.168.0.1sudo sysdig fd.ip=192.168.0.1sudo sysdig -s2000 -A -c echo_fds fd.cip=192.168.0.1​# List all the incoming connections that are not served by apache.sudo sysdig -p"%proc.name %fd.name" "evt.type=accept and proc.name!=httpd"​# View the CPU/Network/IO usage of the processes running inside the container.sudo sysdig -pc -c topprocs_cpu container.id=2e854c4525b8sudo sysdig -pc -c topprocs_net container.id=2e854c4525b8sudo sysdig -pc -c topfiles_bytes container.id=2e854c4525b8​# See the files where apache spends the most time doing I/Osudo sysdig -c topfiles_time proc.name=httpd​# Show all the interactive commands executed inside a given container.sudo sysdig -pc -c spy_users ​# Show every time a file is opened under /etc.sudo sysdig evt.type=open and fd.name​# View the list of processes with container contextsudo csysdig -pc
```

更多示例和使用方法可以参考 [Sysdig User Guide](https://github.com/draios/sysdig/wiki/Sysdig-User-Guide)。

## Weave Scope <a href="#weave-scope" id="weave-scope"></a>

Weave Scope 是另外一款可视化容器监控和排错工具。与 sysdig 相比，它没有强大的命令行工具，但提供了一个简单易用的交互界面，自动描绘了整个集群的拓扑，并可以通过插件扩展其功能。从其官网的介绍来看，其提供的功能包括

* ​[交互式拓扑界面](https://www.weave.works/docs/scope/latest/features/#topology-mapping)​
* ​[图形模式和表格模式](https://www.weave.works/docs/scope/latest/features/#mode)​
* ​[过滤功能](https://www.weave.works/docs/scope/latest/features/#flexible-filtering)​
* ​[搜索功能](https://www.weave.works/docs/scope/latest/features/#powerful-search)​
* ​[实时度量](https://www.weave.works/docs/scope/latest/features/#real-time-app-and-container-metrics)​
* ​[容器排错](https://www.weave.works/docs/scope/latest/features/#interact-with-and-manage-containers)​
* ​[插件扩展](https://www.weave.works/docs/scope/latest/features/#custom-plugins)​

Weave Scope 由 [App 和 Probe 两部分](https://www.weave.works/docs/scope/latest/how-it-works)组成，它们

* Probe 负责收集容器和宿主的信息，并发送给 App
* App 负责处理这些信息，并生成相应的报告，并以交互界面的形式展示

```
                    +--Docker host----------+      +--Docker host----------+.---------------.   |  +--Container------+  |      |  +--Container------+  || Browser       |   |  |                 |  |      |  |                 |  ||---------------|   |  |  +-----------+  |  |      |  |  +-----------+  |  ||               |----->|  | scope-app |<-----.    .----->| scope-app |  |  ||               |   |  |  +-----------+  |  | \  / |  |  +-----------+  |  ||               |   |  |        ^        |  |  \/  |  |        ^        |  |'---------------'   |  |        |        |  |  /\  |  |        |        |  |                    |  | +-------------+ |  | /  \ |  | +-------------+ |  |                    |  | | scope-probe |-----'    '-----| scope-probe | |  |                    |  | +-------------+ |  |      |  | +-------------+ |  |                    |  |                 |  |      |  |                 |  |                    |  +-----------------+  |      |  +-----------------+  |                    +-----------------------+      +-----------------------+
```

### 安装 <a href="#an-zhuang-1" id="an-zhuang-1"></a>

```
kubectl apply -f "https://cloud.weave.works/k8s/scope.yaml?k8s-version=$(kubectl version | base64 | tr -d '\n')&k8s-service-type=LoadBalancer"
```

### 查看界面 <a href="#cha-kan-jie-mian" id="cha-kan-jie-mian"></a>

安装完成后，可以通过 weave-scope-app 来访问交互界面

```
kubectl -n weave get service weave-scope-appkubectl -n weave port-forward service/weave-scope-app :80
```

![](https://blobscdn.gitbook.com/v0/b/gitbook-28427.appspot.com/o/assets%2F-LDAOok5ngY4pc1lEDes%2F-LM_rqip-tinVoiFZE0I%2F-LM_s4D0LrMIzahbE7N5%2Fweave-scope.png?generation=1537160010349344\&alt=media)

点击 Pod，还可以查看该 Pod 所有容器的实时状态和度量数据：![](https://blobscdn.gitbook.com/v0/b/gitbook-28427.appspot.com/o/assets%2F-LDAOok5ngY4pc1lEDes%2F-LM_rqip-tinVoiFZE0I%2F-LM_s4D2u6giEHDhgIP5%2Fscope-pod.png?generation=1537160010963692\&alt=media)

### 已知问题 <a href="#yi-zhi-wen-ti" id="yi-zhi-wen-ti"></a>

在 Ubuntu 内核 4.4.0 上面开启 `--probe.ebpf.connections` 时（默认开启），Node 有可能会因为[内核问题而不停重启](https://github.com/weaveworks/scope/issues/3131)：

```
[ 263.736006] CPU: 0 PID: 6309 Comm: scope Not tainted 4.4.0-119-generic #143-Ubuntu[ 263.736006] Hardware name: Microsoft Corporation Virtual Machine/Virtual Machine, BIOS 090007 06/02/2017[ 263.736006] task: ffff88011cef5400 ti: ffff88000a0e4000 task.ti: ffff88000a0e4000[ 263.736006] RIP: 0010:[] [] bpf_map_lookup_elem+0x6/0x20[ 263.736006] RSP: 0018:ffff88000a0e7a70 EFLAGS: 00010082[ 263.736006] RAX: ffffffff8117cd70 RBX: ffffc90000762068 RCX: 0000000000000000[ 263.736006] RDX: 0000000000000000 RSI: ffff88000a0e7cd8 RDI: 000000001cdee380[ 263.736006] RBP: ffff88000a0e7cf8 R08: 0000000005080021 R09: 0000000000000000[ 263.736006] R10: 0000000000000020 R11: ffff880159e1c700 R12: 0000000000000000[ 263.736006] R13: ffff88011cfaf400 R14: ffff88000a0e7e38 R15: ffff88000a0f8800[ 263.736006] FS: 00007f5b0cd79700(0000) GS:ffff88015b600000(0000) knlGS:0000000000000000[ 263.736006] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033[ 263.736006] CR2: 000000001cdee3a8 CR3: 000000011ce04000 CR4: 0000000000040670[ 263.736006] Stack:[ 263.736006] ffff88000a0e7cf8 ffffffff81177411 0000000000000000 00001887000018a5[ 263.736006] 000000001cdee380 ffff88000a0e7cd8 0000000000000000 0000000000000000[ 263.736006] 0000000005080021 ffff88000a0e7e38 0000000000000000 0000000000000046[ 263.736006] Call Trace:[ 263.736006] [] ? __bpf_prog_run+0x7a1/0x1360[ 263.736006] [] ? update_curr+0x79/0x170[ 263.736006] [] ? update_cfs_shares+0xbc/0x100[ 263.736006] [] ? update_curr+0x79/0x170[ 263.736006] [] ? dput+0xb8/0x230[ 263.736006] [] ? follow_managed+0x265/0x300[ 263.736006] [] ? kmem_cache_alloc_trace+0x1d4/0x1f0[ 263.736006] [] ? seq_open+0x5a/0xa0[ 263.736006] [] ? probes_open+0x33/0x100[ 263.736006] [] ? dput+0x34/0x230[ 263.736006] [] ? mntput+0x24/0x40[ 263.736006] [] trace_call_bpf+0x37/0x50[ 263.736006] [] kretprobe_perf_func+0x3d/0x250[ 263.736006] [] ? pre_handler_kretprobe+0x135/0x1b0[ 263.736006] [] kretprobe_dispatcher+0x3d/0x60[ 263.736006] [] ? do_sys_open+0x1b2/0x2a0[ 263.736006] [] ? kretprobe_trampoline_holder+0x9/0x9[ 263.736006] [] trampoline_handler+0x133/0x210[ 263.736006] [] ? do_sys_open+0x1b2/0x2a0[ 263.736006] [] kretprobe_trampoline+0x25/0x57[ 263.736006] [] ? kretprobe_trampoline_holder+0x9/0x9[ 263.736006] [] SyS_openat+0x14/0x20[ 263.736006] [] entry_SYSCALL_64_fastpath+0x1c/0xbb
```

解决方法有两种

* 禁止 eBPF 探测，如 `--probe.ebpf.connections=false`
* 升级内核，如升级到 4.13.0

## 参考文档 <a href="#can-kao-wen-dang" id="can-kao-wen-dang"></a>

* ​[Overview of kubectl](https://kubernetes.io/docs/reference/kubectl/overview/)​
* ​[Monitoring Kuberietes with sysdig](https://sysdig.com/blog/kubernetes-service-discovery-docker/)​
