4. 挂载NFS存储卷异常
在我接手容器项目时,容器集群的持久化存储正处于混沌状态:
- 控制台上允许创建 Local-PV、NFS、Ceph 三种类型的存储类,但是没有操作文档
- 客户创建了 Local-PV 类型的 PVC 后发现无法使用,于是报障:因为集群里没有对应的驱动来自动使用本地盘的 PV
- 客户想要使用 NFS 存储数据,就在存储产品中创建了文件存储,但通过容器集群控制台添加存储类后却发现挂载不上 NFS 类型的 PV
总之一句话,容器集群的存储几乎不可用(难以想象之前的驻场同事是如何应对客户的)。在我投入了大量精力编写文档、优化交互后,集群存储总算进入到一个可用状态,以 NFS 存储为主,可选使用 Local-PV,后来也开发了新的 CSI 对接 ECS 使用的块存储。
原以为存储相关功能已经进入稳定状态,没想到在 ARM64 上又栽了个大跟头,这里记录下排障过程。
有一位客户反馈 Deployment 无法更新成功,初步排查发现:
- Deployment 的重启策略为滚动更新,它的 Pod 模板中挂载了一个 NFS 存储卷
- 老的 Pod 处于 Terminating 状态无法删除
- 新的 Pod 卡在 ContainerCreating 状态,执行 describe pod 后看到挂载 NFS 存储卷超时
- 修改 csi-nfs-node 的 resources 或者重启 csi-nfs-node 的 Pod 还是无法解决问题(参考之前以往的解决办法还需要重启节点,客户不接受)
- 优先提供临时方案绕过问题
- 联系存储部门同事一起定位根本原因
下面是一个存储卷的生命周期,如果卸载存储卷的操作失败,就会阻塞 Pod 销毁流程,产生 Terminating 状态的 Pod,而挂载存储失败则会阻塞 Pod 创建流程,造成 ContainerCreating。
在使用 NFS 存储卷时,负责执行 mount 和 umount 的 Pod 是运行在节点上的 csi-nfs-node。它会在容器内执行挂载/卸载操作,挂载目录与宿主机目录双向绑定,容器内挂载好文件目录后,宿主机上就可以把该目录挂载到使用 NFS 存储卷的业务容器内的指定路径。
我联系了驻场同事提供远程,在检查异常 Pod 所在节点的内核日志与 csi-nfs-node 容器的历史日志时,发现有许多 mount.nfs 被系统杀死的记录,原因都是 OOM Kill。
客户使用的 NFS 存储类挂载参数只配置了nolock与vers=3,实际协商出的挂载参数如下:
rw,relatime,vers=3,rsize=1048576,wsize=1048576,namlen=255,hard,nolock,proto=tcp,timeo=600,retrans=2,sec=sys,mountaddr=10.255.176.233,mountvers=3,mountport=8752,mountproto=tcp,local_lock=all,addr=10.255.176.233
每个 NFS 存储卷都挂载到了 Pod 里 Volume 对应的位置:
同一个 NFS 存储卷被 N 个Pod使用,就会产生 N 次挂载/卸载,但内核使用同一个连接来传输数据,如下,在两个 Pod 中分别往 NFS 存储中 dd 写入数据,节点上只看到一个 TCP 连接。
由于网络问题,只能搜集数据回公司内部继续分析,为了减少 OOM 对客户工作负载的影响,先将客户集群内 csi-nfs-node 组件的 CPU、Mem 配额移除,允许使用更多的内存。
下面要在公司内部环境复现问题,客户现场使用的飞腾处理器与麒麟操作系统,同时需要使用存储团队提供的文件存储,在折腾一番后终于找到了一个合适的私有云环境测试。
我创建了一个 nginx 的 Deployment 挂载一个 NFS 存储卷,副本数设置为 6,更新策略设置为 Recreate,每次都等 Pod 运行稳定后执行 restart 触发重建。
可以看到 Pod 的销毁流程卡在 Terminating,检查 csi-nfs-nod 的 Pod 确认出现了 OOM,节点上的日志如下:
hmem:0KB mapped_file:0KB dirty:0KB writeback:0KB swap:0KB inactive_anon:0KB active_anon:193152KB inactive_file:0KB active_file:0KB unevictable:0KB
[Fri Aug 18 15:46:16 2023] [ pid ] uid tgid total_vm rss pgtables_bytes swapents oom_score_adj name
[Fri Aug 18 15:46:16 2023] [197236] 0 197236 84 41 393216 0 -998 umount
[Fri Aug 18 15:46:16 2023] [197237] 0 197237 84 41 393216 0 -998 umount
[Fri Aug 18 15:46:16 2023] [197239] 0 197239 84 41 327680 0 -998 umount
[Fri Aug 18 15:46:16 2023] [197240] 0 197240 1017 834 327680 0 -998 umount.nfs
[Fri Aug 18 15:46:16 2023] [197242] 0 197242 1017 702 393216 0 -998 umount.nfs
[Fri Aug 18 15:46:16 2023] [197243] 0 197243 1017 702 393216 0 -998 umount.nfs
[Fri Aug 18 15:46:16 2023] [197245] 0 197245 1018 885 393216 0 -998 umount.nfs
下面 exec 进入 csi-nfs-node 的 nfs 容器查看内存占用,从测试结果看是 nfs 容器内程序内存占用飙升,导致程序被杀死,连带 mount 和 unmount 操作一起异常了,测试过程中通过 top 监测发现一个 umount 操作大概占用 53MB 内存,一个 mount 操作大概占用 100MB 内存。
将 nfs 容器的 limit 改为 100MiB,并给一个 Pod 挂载两个 NFS 存储卷,可稳定复现问题。
删除 Pod 的操作中需要执行 umount,umount 执行到一半时容器被杀死,导致Pod一直处于 Terminating,阻塞 Deployment的滚动更新,查看 nfs 容器和节点 kubelet 日志,都可以看到卸载 Volume 异常。
快速恢复办法是强制删除 Pod 来继续触发滚动更新:kubectl delete pod --force xxx
。
创建 Pod 的操作中需要执行 mount,mount 执行到一半时容器被杀死,导致 Pod 一直处于 ContainerCreating。
连续的挂载失败应该是触发了延迟重试机制,导致 Pod 处于 ContainerCreating 状态无法快速恢复,这时删除 csi-nfs-node 的 Pod 是无效的,因为重试的对象是业务 Pod 而不是 CSI,可以看到日志中出现 durationDeforeRetry 的日志。
快速恢复办法是删除 Pod 触发新建 Pod:kubectl delete pod xxx
。
但异常卸载导致 Pod 残留,kubelet 在处理残留 Pod 时,会提示 Pod 的 Volumes 目录的存储卷未卸载,导致无法删除。
从日志看像是NFS的CSI插件没有正确处理这种异常情况:
Aug 19 23:51:30 prhxhu5im6b4he kubelet[1596349]: E0819 23:51:30.898516 1596349 kubelet_volumes.go:179] orphaned pod "0dc214d7-f193-4ad9-ba9f-3eb8959d595c" found, but failed to rmdir() volume at path /var/lib/kubelet/pods/0dc214d7-f193-4ad9-ba9f-3eb8959d595c/volumes/kubernetes.io~csi/pvc-aee1cadb-8a63-4152-9dc7-f2c840019851: directory not empty : There were a total of 12 errors similar to this. Turn up verbosity to see them.
Aug 19 23:51:31 prhxhu5im6b4he kubelet[1596349]: I0819 23:51:31.107416 1596349 operation_generator.go:657] MountVolume.SetUp succeeded for volume "pvc-af7c2117-da24-4388-aec3-6a9a7cdc765f" (UniqueName: "kubernetes.io/csi/nfs.csi.k8s.io^10.255.176.233/share_72adb8c2b/pvc-af7c2117-da24-4388-aec3-6a9a7cdc765f") pod "nginx-01-68467b8cc5-qsp5k" (UID: "8054e2c8-1a06-4683-8f69-23a22729fd71")
Aug 19 23:51:32 prhxhu5im6b4he kubelet[1596349]: E0819 23:51:32.887278 1596349 kubelet_volumes.go:179] orphaned pod "0dc214d7-f193-4ad9-ba9f-3eb8959d595c" found, but failed to rmdir() volume at path /var/lib/kubelet/pods/0dc214d7-f193-4ad9-ba9f-3eb8959d595c/volumes/kubernetes.io~csi/pvc-aee1cadb-8a63-4152-9dc7-f2c840019851: directory not empty : There were a total of 12 errors similar to this. Turn up verbosity to see them.
Aug 19 23:51:34 prhxhu5im6b4he kubelet[1596349]: E0819 23:51:34.898730 1596349 kubelet_volumes.go:179] orphaned pod "0dc214d7-f193-4ad9-ba9f-3eb8959d595c" found, but failed to rmdir() volume at path /var/lib/kubelet/pods/0dc214d7-f193-4ad9-ba9f-3eb8959d595c/volumes/kubernetes.io~csi/pvc-aee1cadb-8a63-4152-9dc7-f2c840019851: directory not empty : There were a total of 12 errors similar to this. Turn up verbosity to see them.
Aug 19 23:51:36 prhxhu5im6b4he kubelet[1596349]: E0819 23:51:36.899249 1596349 kubelet_volumes.go:179] orphaned pod "0dc214d7-f193-4ad9-ba9f-3eb8959d595c" found, but failed to rmdir() volume at path /var/lib/kubelet/pods/0dc214d7-f193-4ad9-ba9f-3eb8959d595c/volumes/kubernetes.io~csi/pvc-aee1cadb-8a63-4152-9dc7-f2c840019851: directory not empty : There were a total of 12 errors similar to this. Turn up verbosity to see them.
Aug 19 23:51:38 prhxhu5im6b4he kubelet[1596349]: E0819 23:51:38.899951 1596349 kubelet_volumes.go:179] orphaned pod "0dc214d7-f193-4ad9-ba9f-3eb8959d595c" found, but failed to rmdir() volume at path /var/lib/kubelet/pods/0dc214d7-f193-4ad9-ba9f-3eb8959d595c/volumes/kubernetes.io~csi/pvc-aee1cadb-8a63-4152-9dc7-f2c840019851: directory not empty : There were a total of 12 errors similar to this. Turn up verbosity to see them.
可以导出 kubelet 日志,grep 出 rmdir 的异常日志,获取到这些处于 /var/lib/kubelet/pods/xxx/volumes/kubernetes.io~csi/pvc-xxx/mount 的路径执行umount,最后使用 rmdir 删除目录。
接下来使用 time 和 strace 继续分析。
挂载NFS存储卷
[root@prj4diysa6nosv ~]# /usr/bin/time -v mount -t nfs -o vers=3,nolock 10.255.176.233:/share_72adb8c2b /mnt
Command being timed: "mount -t nfs -o vers=3,nolock 10.255.176.233:/share_72adb8c2b /mnt"
User time (seconds): 0.04
System time (seconds): 0.07
Percent of CPU this job got: 3%
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:03.14
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 204608
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 0
Minor (reclaiming a frame) page faults: 3358
Voluntary context switches: 25
Involuntary context switches: 1
Swaps: 0
File system inputs: 0
File system outputs: 0
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 0
Page size (bytes): 65536
Exit status: 0
卸载NFS存储卷
[root@prj4diysa6nosv ~]# /usr/bin/time -v umount /mnt
Command being timed: "umount /mnt"
User time (seconds): 0.03
System time (seconds): 0.04
Percent of CPU this job got: 2%
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:03.19
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 204608
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 0
Minor (reclaiming a frame) page faults: 3370
Voluntary context switches: 17
Involuntary context switches: 1
Swaps: 0
File system inputs: 0
File system outputs: 0
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 0
Page size (bytes): 65536
Exit status: 0
可以看到一次操作确确实实占用了 200MB 内存,而且输出信息显示 PAGESIZE 大小为 64KB。
我切换到另一个 AMD64 的系统上测试挂载,结果显示内存占用只有 3984KB,PAGESIZE 大小为 4KB。
➜ ~ /usr/bin/time -v mount -t nfs -o vers=3,nolock,proto=tcp 192.168.123.2:/local-zfs/data /tmp/nfs
Command being timed: "mount -t nfs -o vers=3,nolock,proto=tcp 192.168.123.2:/local-zfs/data /tmp/nfs"
User time (seconds): 0.00
System time (seconds): 0.00
Percent of CPU this job got: 22%
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:00.02
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 3984
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 0
Minor (reclaiming a frame) page faults: 416
Voluntary context switches: 26
Involuntary context switches: 3
Swaps: 0
File system inputs: 0
File system outputs: 0
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status: 0
➜ ~ /usr/bin/time -v umount /tmp/nfs
Command being timed: "umount /tmp/nfs"
User time (seconds): 0.00
System time (seconds): 0.00
Percent of CPU this job got: 36%
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:00.01
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 3880
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 2
Minor (reclaiming a frame) page faults: 427
Voluntary context switches: 18
Involuntary context switches: 1
Swaps: 0
File system inputs: 80
File system outputs: 0
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status: 0
再使用 strace 打印系统调用:
[root@prj4diysa6nosv ~]# strace mount -t nfs -o vers=3,nolock 10.255.176.233:/share_72adb8c2b /mnt
execve("/usr/bin/mount", ["mount", "-t", "nfs", "-o", "vers=3,nolock", "10.255.176.233:/share_72adb8c2b", "/mnt"], 0xffffcc280630 /* 28 vars */) = 0
brk(NULL) = 0xaaaaf2d40000
faccessat(AT_FDCWD, "/etc/ld.so.preload", R_OK) = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, "/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 3
fstat(3, {st_mode=S_IFREG|0644, st_size=20812, ...}) = 0
mmap(NULL, 20812, PROT_READ, MAP_PRIVATE, 3, 0) = 0xffffb73f0000
close(3) = 0
openat(AT_FDCWD, "/lib64/libmount.so.1", O_RDONLY|O_CLOEXEC) = 3
read(3, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0\267\0\1\0\0\0\260\317\0\0\0\0\0\0"..., 832) = 832
fstat(3, {st_mode=S_IFREG|0755, st_size=403992, ...}) = 0
mmap(NULL, 464112, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0xffffb7330000
mmap(0xffffb7390000, 131072, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x50000) = 0xffffb7390000
close(3) = 0
openat(AT_FDCWD, "/lib64/libblkid.so.1", O_RDONLY|O_CLOEXEC) = 3
read(3, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0\267\0\1\0\0\0\0\250\0\0\0\0\0\0"..., 832) = 832
fstat(3, {st_mode=S_IFREG|0755, st_size=339288, ...}) = 0
mmap(NULL, 399088, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0xffffb72c0000
mmap(0xffffb7310000, 131072, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x40000) = 0xffffb7310000
close(3) = 0
openat(AT_FDCWD, "/lib64/libuuid.so.1", O_RDONLY|O_CLOEXEC) = 3
read(3, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0\267\0\1\0\0\0\260\27\0\0\0\0\0\0"..., 832) = 832
fstat(3, {st_mode=S_IFREG|0755, st_size=70080, ...}) = 0
mmap(NULL, 131096, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0xffffb7290000
mmap(0xffffb72a0000, 65536, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0) = 0xffffb72a0000
mmap(0xffffb72b0000, 24, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0xffffb72b0000
close(3) = 0
openat(AT_FDCWD, "/lib64/librt.so.1", O_RDONLY|O_CLOEXEC) = 3
read(3, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0\267\0\1\0\0\0\220\36\0\0\0\0\0\0"..., 832) = 832
fstat(3, {st_mode=S_IFREG|0755, st_size=136656, ...}) = 0
mmap(NULL, 131872, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0xffffb7260000
mmap(0xffffb7270000, 131072, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0) = 0xffffb7270000
close(3) = 0
openat(AT_FDCWD, "/lib64/libselinux.so.1", O_RDONLY|O_CLOEXEC) = 3
read(3, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0\267\0\1\0\0\0\20o\0\0\0\0\0\0"..., 832) = 832
fstat(3, {st_mode=S_IFREG|0755, st_size=344872, ...}) = 0
mmap(NULL, 271744, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0xffffb7210000
mmap(0xffffb7240000, 131072, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x20000) = 0xffffb7240000
close(3) = 0
openat(AT_FDCWD, "/lib64/libc.so.6", O_RDONLY|O_CLOEXEC) = 3
read(3, "\177ELF\2\1\1\3\0\0\0\0\0\0\0\0\3\0\267\0\1\0\0\0000\r\2\0\0\0\0\0"..., 832) = 832
fstat(3, {st_mode=S_IFREG|0755, st_size=4097512, ...}) = 0
mmap(NULL, 1527680, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0xffffb7090000
mmap(0xffffb71f0000, 131072, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x150000) = 0xffffb71f0000
close(3) = 0
openat(AT_FDCWD, "/lib64/libpthread.so.0", O_RDONLY|O_CLOEXEC) = 3
read(3, "\177ELF\2\1\1\3\0\0\0\0\0\0\0\0\3\0\267\0\1\0\0\0\0d\0\0\0\0\0\0"..., 832) = 832
fstat(3, {st_mode=S_IFREG|0755, st_size=531384, ...}) = 0
mmap(NULL, 214016, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0xffffb7050000
mmap(0xffffb7070000, 131072, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x10000) = 0xffffb7070000
close(3) = 0
openat(AT_FDCWD, "/lib64/libpcre2-8.so.0", O_RDONLY|O_CLOEXEC) = 3
read(3, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0\267\0\1\0\0\0 $\0\0\0\0\0\0"..., 832) = 832
fstat(3, {st_mode=S_IFREG|0755, st_size=540032, ...}) = 0
mmap(NULL, 590272, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0xffffb6fb0000
mmap(0xffffb7030000, 131072, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x70000) = 0xffffb7030000
close(3) = 0
openat(AT_FDCWD, "/lib64/libdl.so.2", O_RDONLY|O_CLOEXEC) = 3
read(3, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0\267\0\1\0\0\0P\17\0\0\0\0\0\0"..., 832) = 832
fstat(3, {st_mode=S_IFREG|0755, st_size=93528, ...}) = 0
mmap(NULL, 131320, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0xffffb6f80000
mmap(0xffffb6f90000, 131072, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0) = 0xffffb6f90000
close(3) = 0
mprotect(0xffffb71f0000, 65536, PROT_READ) = 0
mprotect(0xffffb6f90000, 65536, PROT_READ) = 0
mprotect(0xffffb7070000, 65536, PROT_READ) = 0
mprotect(0xffffb7030000, 65536, PROT_READ) = 0
mprotect(0xffffb7240000, 65536, PROT_READ) = 0
mprotect(0xffffb7270000, 65536, PROT_READ) = 0
mprotect(0xffffb72a0000, 65536, PROT_READ) = 0
mprotect(0xffffb7310000, 65536, PROT_READ) = 0
mprotect(0xffffb7390000, 65536, PROT_READ) = 0
mprotect(0xaaaadce70000, 65536, PROT_READ) = 0
mprotect(0xffffb7400000, 65536, PROT_READ) = 0
munmap(0xffffb73f0000, 20812) = 0
set_tid_address(0xffffb7415f50) = 1137508
set_robust_list(0xffffb7415f60, 24) = 0
rt_sigaction(SIGRTMIN, {sa_handler=0xffffb7055e80, sa_mask=[], sa_flags=SA_SIGINFO}, NULL, 8) = 0
rt_sigaction(SIGRT_1, {sa_handler=0xffffb7055f40, sa_mask=[], sa_flags=SA_RESTART|SA_SIGINFO}, NULL, 8) = 0
rt_sigprocmask(SIG_UNBLOCK, [RTMIN RT_1], NULL, 8) = 0
prlimit64(0, RLIMIT_STACK, NULL, {rlim_cur=8192*1024, rlim_max=RLIM64_INFINITY}) = 0
statfs("/sys/fs/selinux", {f_type=SELINUX_MAGIC, f_bsize=65536, f_blocks=0, f_bfree=0, f_bavail=0, f_files=0, f_ffree=0, f_fsid={val=[0, 0]}, f_namelen=255, f_frsize=65536, f_flags=ST_VALID|ST_RELATIME}) = 0
statfs("/sys/fs/selinux", {f_type=SELINUX_MAGIC, f_bsize=65536, f_blocks=0, f_bfree=0, f_bavail=0, f_files=0, f_ffree=0, f_fsid={val=[0, 0]}, f_namelen=255, f_frsize=65536, f_flags=ST_VALID|ST_RELATIME}) = 0
brk(NULL) = 0xaaaaf2d40000
brk(0xaaaaf2d70000) = 0xaaaaf2d70000
faccessat(AT_FDCWD, "/etc/selinux/config", F_OK) = 0
openat(AT_FDCWD, "/usr/lib/locale/locale-archive", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, "/usr/share/locale/locale.alias", O_RDONLY|O_CLOEXEC) = 3
fstat(3, {st_mode=S_IFREG|0644, st_size=2997, ...}) = 0
read(3, "# Locale name alias data base.\n#"..., 8192) = 2997
read(3, "", 8192) = 0
close(3) = 0
openat(AT_FDCWD, "/usr/lib/locale/en_US.UTF-8/LC_IDENTIFICATION", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, "/usr/lib/locale/en_US.utf8/LC_IDENTIFICATION", O_RDONLY|O_CLOEXEC) = 3
fstat(3, {st_mode=S_IFREG|0644, st_size=368, ...}) = 0
mmap(NULL, 368, PROT_READ, MAP_PRIVATE, 3, 0) = 0xffffb73f0000
close(3) = 0
openat(AT_FDCWD, "/usr/lib64/gconv/gconv-modules.cache", O_RDONLY) = 3
fstat(3, {st_mode=S_IFREG|0644, st_size=26998, ...}) = 0
mmap(NULL, 26998, PROT_READ, MAP_SHARED, 3, 0) = 0xffffb6f70000
close(3) = 0
futex(0xffffb7201768, FUTEX_WAKE_PRIVATE, 2147483647) = 0
openat(AT_FDCWD, "/usr/lib/locale/en_US.UTF-8/LC_MEASUREMENT", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, "/usr/lib/locale/en_US.utf8/LC_MEASUREMENT", O_RDONLY|O_CLOEXEC) = 3
fstat(3, {st_mode=S_IFREG|0644, st_size=23, ...}) = 0
mmap(NULL, 23, PROT_READ, MAP_PRIVATE, 3, 0) = 0xffffb6f60000
close(3) = 0
openat(AT_FDCWD, "/usr/lib/locale/en_US.UTF-8/LC_TELEPHONE", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, "/usr/lib/locale/en_US.utf8/LC_TELEPHONE", O_RDONLY|O_CLOEXEC) = 3
fstat(3, {st_mode=S_IFREG|0644, st_size=59, ...}) = 0
mmap(NULL, 59, PROT_READ, MAP_PRIVATE, 3, 0) = 0xffffb6f50000
close(3) = 0
openat(AT_FDCWD, "/usr/lib/locale/en_US.UTF-8/LC_ADDRESS", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, "/usr/lib/locale/en_US.utf8/LC_ADDRESS", O_RDONLY|O_CLOEXEC) = 3
fstat(3, {st_mode=S_IFREG|0644, st_size=167, ...}) = 0
mmap(NULL, 167, PROT_READ, MAP_PRIVATE, 3, 0) = 0xffffb6f40000
close(3) = 0
openat(AT_FDCWD, "/usr/lib/locale/en_US.UTF-8/LC_NAME", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, "/usr/lib/locale/en_US.utf8/LC_NAME", O_RDONLY|O_CLOEXEC) = 3
fstat(3, {st_mode=S_IFREG|0644, st_size=77, ...}) = 0
mmap(NULL, 77, PROT_READ, MAP_PRIVATE, 3, 0) = 0xffffb6f30000
close(3) = 0
openat(AT_FDCWD, "/usr/lib/locale/en_US.UTF-8/LC_PAPER", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, "/usr/lib/locale/en_US.utf8/LC_PAPER", O_RDONLY|O_CLOEXEC) = 3
fstat(3, {st_mode=S_IFREG|0644, st_size=34, ...}) = 0
mmap(NULL, 34, PROT_READ, MAP_PRIVATE, 3, 0) = 0xffffb6f20000
close(3) = 0
openat(AT_FDCWD, "/usr/lib/locale/en_US.UTF-8/LC_MESSAGES", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, "/usr/lib/locale/en_US.utf8/LC_MESSAGES", O_RDONLY|O_CLOEXEC) = 3
fstat(3, {st_mode=S_IFDIR|0755, st_size=29, ...}) = 0
close(3) = 0
openat(AT_FDCWD, "/usr/lib/locale/en_US.utf8/LC_MESSAGES/SYS_LC_MESSAGES", O_RDONLY|O_CLOEXEC) = 3
fstat(3, {st_mode=S_IFREG|0644, st_size=57, ...}) = 0
mmap(NULL, 57, PROT_READ, MAP_PRIVATE, 3, 0) = 0xffffb6f10000
close(3) = 0
openat(AT_FDCWD, "/usr/lib/locale/en_US.UTF-8/LC_MONETARY", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, "/usr/lib/locale/en_US.utf8/LC_MONETARY", O_RDONLY|O_CLOEXEC) = 3
fstat(3, {st_mode=S_IFREG|0644, st_size=286, ...}) = 0
mmap(NULL, 286, PROT_READ, MAP_PRIVATE, 3, 0) = 0xffffb6f00000
close(3) = 0
openat(AT_FDCWD, "/usr/lib/locale/en_US.UTF-8/LC_COLLATE", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, "/usr/lib/locale/en_US.utf8/LC_COLLATE", O_RDONLY|O_CLOEXEC) = 3
fstat(3, {st_mode=S_IFREG|0644, st_size=2586930, ...}) = 0
mmap(NULL, 2586930, PROT_READ, MAP_PRIVATE, 3, 0) = 0xffffb6c80000
close(3) = 0
openat(AT_FDCWD, "/usr/lib/locale/en_US.UTF-8/LC_TIME", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, "/usr/lib/locale/en_US.utf8/LC_TIME", O_RDONLY|O_CLOEXEC) = 3
fstat(3, {st_mode=S_IFREG|0644, st_size=3316, ...}) = 0
mmap(NULL, 3316, PROT_READ, MAP_PRIVATE, 3, 0) = 0xffffb6c70000
close(3) = 0
openat(AT_FDCWD, "/usr/lib/locale/en_US.UTF-8/LC_NUMERIC", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, "/usr/lib/locale/en_US.utf8/LC_NUMERIC", O_RDONLY|O_CLOEXEC) = 3
fstat(3, {st_mode=S_IFREG|0644, st_size=54, ...}) = 0
mmap(NULL, 54, PROT_READ, MAP_PRIVATE, 3, 0) = 0xffffb6c60000
close(3) = 0
openat(AT_FDCWD, "/usr/lib/locale/en_US.UTF-8/LC_CTYPE", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, "/usr/lib/locale/en_US.utf8/LC_CTYPE", O_RDONLY|O_CLOEXEC) = 3
fstat(3, {st_mode=S_IFREG|0644, st_size=337024, ...}) = 0
mmap(NULL, 337024, PROT_READ, MAP_PRIVATE, 3, 0) = 0xffffb6c00000
close(3) = 0
getuid() = 0
geteuid() = 0
newfstatat(AT_FDCWD, "/mnt", {st_mode=S_IFDIR|0755, st_size=6, ...}, AT_SYMLINK_NOFOLLOW) = 0
newfstatat(AT_FDCWD, "/sbin/mount.nfs", {st_mode=S_IFREG|S_ISUID|0755, st_size=208136, ...}, 0) = 0
getuid() = 0
geteuid() = 0
getgid() = 0
getegid() = 0
prctl(PR_GET_DUMPABLE) = 1 (SUID_DUMP_USER)
newfstatat(AT_FDCWD, "/run", {st_mode=S_IFDIR|0755, st_size=900, ...}, 0) = 0
newfstatat(AT_FDCWD, "/run/mount/utab", {st_mode=S_IFREG|0644, st_size=0, ...}, AT_SYMLINK_NOFOLLOW) = 0
newfstatat(AT_FDCWD, "/run/mount/utab", {st_mode=S_IFREG|0644, st_size=0, ...}, 0) = 0
geteuid() = 0
getegid() = 0
getuid() = 0
getgid() = 0
faccessat(AT_FDCWD, "/run/mount/utab", R_OK|W_OK) = 0
clone(child_stack=NULL, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0xffffb7415f50) = 1137509
wait4(1137509, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], 0, NULL) = 1137509
--- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=1137509, si_uid=0, si_status=0, si_utime=0, si_stime=13} ---
close(1) = 0
close(2) = 0
exit_group(0) = ?
+++ exited with 0 +++
从输出日志中可以看到有两次尝试连接 NFS 服务器的操作,似乎是在测试服务器支持哪一种传输协议,卸载操作也有类似的日志,于是我测试在挂载参数中增加 proto=tcp
后,结果内存占用居然下降了一半。
看起来是在握手过程中分配的某些 buffer 占用了过多的空间,不过 NFS 模块集成在内核中,而且公司内部 ARM64 和 AMD64 的内核版本也是一致的,容器产品在两个架构使用的内核参数也是一致的,原因大概率是 ARM64 内核的 PAGESIZE 过大,造成所有涉及内存的操作都占用过多内存(NFS 内核代码中有大量使用 PAGESIZE 为单位分配内存的操作)。
由于 PAGESIZE 是在编译内核时设定的,而内核是由 OS 团队定制,Leader 找到 OS 团队负责人一番讨论后,OS 团队表示还是维持现状且没有给出明确原因……
- 定位 ARM64 平台挂载存储卷失败原因,用户使用的操作系统 PAGESIZE 设定为 64KB,远远大于正常的 4KB,因此内存占用高
- 调整创建 NFS 存储类的挂载参数,增加
proto=tcp
,这会让 ARM64 平台上的 mount.nfs 的内存占用减少一半 - 移除 csi-nfs-node 的内存 limit,允许使用更多的内存。
在我定位 ARM64 的内存问题并处理后,依旧出现 NFS 存储无法挂载的问题,此时 csi-nfs-node 的 nfs 容器日志如下:
登录异常ECS节点执行 dmesg -T
,看到有大量的 kernel: nfs: server xxxxx not responding, still retrying
经过长时间的排查,最终确认客户环境中四个存储服务器有两个存在异常,可以 ping 通服务器 IP,但用 mount 命令挂载这个服务器上的 nfs 共享存储时就会卡住,最终通过重启异常节点的 genesha 服务恢复。
在这个问题确认前,我与存储团队的 SRE 较量了整整两周,对方使用内核参数、挂载参数、使用方法等无数种使用不当理由拒绝技术支持,整个过程没有拉上一位熟悉 NFS 协议的存储研发人员协助排查,着实折磨了我一番。
PAGESIZE 的大小一般是根据处理器架构、操作系统设定的,比如搭载 macOS 14.5 的 M1 Pro 使用 16KB 大小:
➜ ~ uname -sr
Darwin 23.5.0
➜ ~ getconf PAGE_SIZE
16384
AMD64 架构的 Linux 系统一般使用 4KB,而我手机上的一台 Oracle ARM64 服务器也是使用 4KB:
➜ ~ uname -sr
Linux 6.5.0-1021-oracle
➜ ~ getconf PAGE_SIZE
4096
➜ ~ lscpu
Architecture: aarch64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 4
On-line CPU(s) list: 0-3
Vendor ID: ARM
Model name: Neoverse-N1
......
而公司操作系统内核从 4.18 升级到 5.15 后,PAGE_SIZE 依旧是 64KB,只能猜测是为了适配国产处理器不得已为之,或者是最初的某个临时方案沿用至今~看来 mount.nfs 的问题永远无法解决。