Atlantis
GitHub Toggle Dark/Light/Auto mode Toggle Dark/Light/Auto mode Toggle Dark/Light/Auto mode Back to homepage

4. 挂载NFS存储卷异常

1. 前言

在我接手容器项目时,容器集群的持久化存储正处于混沌状态:

  1. 控制台上允许创建 Local-PV、NFS、Ceph 三种类型的存储类,但是没有操作文档
  2. 客户创建了 Local-PV 类型的 PVC 后发现无法使用,于是报障:因为集群里没有对应的驱动来自动使用本地盘的 PV
  3. 客户想要使用 NFS 存储数据,就在存储产品中创建了文件存储,但通过容器集群控制台添加存储类后却发现挂载不上 NFS 类型的 PV

总之一句话,容器集群的存储几乎不可用(难以想象之前的驻场同事是如何应对客户的)。在我投入了大量精力编写文档、优化交互后,集群存储总算进入到一个可用状态,以 NFS 存储为主,可选使用 Local-PV,后来也开发了新的 CSI 对接 ECS 使用的块存储。

原以为存储相关功能已经进入稳定状态,没想到在 ARM64 上又栽了个大跟头,这里记录下排障过程。

2. STAR

2.1 Situation

有一位客户反馈 Deployment 无法更新成功,初步排查发现:

  1. Deployment 的重启策略为滚动更新,它的 Pod 模板中挂载了一个 NFS 存储卷
  2. 老的 Pod 处于 Terminating 状态无法删除
  3. 新的 Pod 卡在 ContainerCreating 状态,执行 describe pod 后看到挂载 NFS 存储卷超时
  4. 修改 csi-nfs-node 的 resources 或者重启 csi-nfs-node 的 Pod 还是无法解决问题(参考之前以往的解决办法还需要重启节点,客户不接受)

2.2 Task

  1. 优先提供临时方案绕过问题
  2. 联系存储部门同事一起定位根本原因

2.3 Action

下面是一个存储卷的生命周期,如果卸载存储卷的操作失败,就会阻塞 Pod 销毁流程,产生 Terminating 状态的 Pod,而挂载存储失败则会阻塞 Pod 创建流程,造成 ContainerCreating。

alt text

在使用 NFS 存储卷时,负责执行 mount 和 umount 的 Pod 是运行在节点上的 csi-nfs-node。它会在容器内执行挂载/卸载操作,挂载目录与宿主机目录双向绑定,容器内挂载好文件目录后,宿主机上就可以把该目录挂载到使用 NFS 存储卷的业务容器内的指定路径。

我联系了驻场同事提供远程,在检查异常 Pod 所在节点的内核日志与 csi-nfs-node 容器的历史日志时,发现有许多 mount.nfs 被系统杀死的记录,原因都是 OOM Kill。

客户使用的 NFS 存储类挂载参数只配置了nolock与vers=3,实际协商出的挂载参数如下:

rw,relatime,vers=3,rsize=1048576,wsize=1048576,namlen=255,hard,nolock,proto=tcp,timeo=600,retrans=2,sec=sys,mountaddr=10.255.176.233,mountvers=3,mountport=8752,mountproto=tcp,local_lock=all,addr=10.255.176.233

每个 NFS 存储卷都挂载到了 Pod 里 Volume 对应的位置:

alt text

同一个 NFS 存储卷被 N 个Pod使用,就会产生 N 次挂载/卸载,但内核使用同一个连接来传输数据,如下,在两个 Pod 中分别往 NFS 存储中 dd 写入数据,节点上只看到一个 TCP 连接。

alt text

由于网络问题,只能搜集数据回公司内部继续分析,为了减少 OOM 对客户工作负载的影响,先将客户集群内 csi-nfs-node 组件的 CPU、Mem 配额移除,允许使用更多的内存。

下面要在公司内部环境复现问题,客户现场使用的飞腾处理器与麒麟操作系统,同时需要使用存储团队提供的文件存储,在折腾一番后终于找到了一个合适的私有云环境测试。

我创建了一个 nginx 的 Deployment 挂载一个 NFS 存储卷,副本数设置为 6,更新策略设置为 Recreate,每次都等 Pod 运行稳定后执行 restart 触发重建。

alt text

可以看到 Pod 的销毁流程卡在 Terminating,检查 csi-nfs-nod 的 Pod 确认出现了 OOM,节点上的日志如下:

hmem:0KB mapped_file:0KB dirty:0KB writeback:0KB swap:0KB inactive_anon:0KB active_anon:193152KB inactive_file:0KB active_file:0KB unevictable:0KB
[Fri Aug 18 15:46:16 2023] [ pid ]   uid  tgid total_vm      rss pgtables_bytes swapents oom_score_adj name
[Fri Aug 18 15:46:16 2023] [197236]     0 197236       84       41   393216        0          -998 umount
[Fri Aug 18 15:46:16 2023] [197237]     0 197237       84       41   393216        0          -998 umount
[Fri Aug 18 15:46:16 2023] [197239]     0 197239       84       41   327680        0          -998 umount
[Fri Aug 18 15:46:16 2023] [197240]     0 197240     1017      834   327680        0          -998 umount.nfs
[Fri Aug 18 15:46:16 2023] [197242]     0 197242     1017      702   393216        0          -998 umount.nfs
[Fri Aug 18 15:46:16 2023] [197243]     0 197243     1017      702   393216        0          -998 umount.nfs
[Fri Aug 18 15:46:16 2023] [197245]     0 197245     1018      885   393216        0          -998 umount.nfs

下面 exec 进入 csi-nfs-node 的 nfs 容器查看内存占用,从测试结果看是 nfs 容器内程序内存占用飙升,导致程序被杀死,连带 mount 和 unmount 操作一起异常了,测试过程中通过 top 监测发现一个 umount 操作大概占用 53MB 内存,一个 mount 操作大概占用 100MB 内存。

alt text

alt text

将 nfs 容器的 limit 改为 100MiB,并给一个 Pod 挂载两个 NFS 存储卷,可稳定复现问题。

alt text

删除 Pod 的操作中需要执行 umount,umount 执行到一半时容器被杀死,导致Pod一直处于 Terminating,阻塞 Deployment的滚动更新,查看 nfs 容器和节点 kubelet 日志,都可以看到卸载 Volume 异常。

alt text

alt text

快速恢复办法是强制删除 Pod 来继续触发滚动更新:kubectl delete pod --force xxx

创建 Pod 的操作中需要执行 mount,mount 执行到一半时容器被杀死,导致 Pod 一直处于 ContainerCreating。

alt text

连续的挂载失败应该是触发了延迟重试机制,导致 Pod 处于 ContainerCreating 状态无法快速恢复,这时删除 csi-nfs-node 的 Pod 是无效的,因为重试的对象是业务 Pod 而不是 CSI,可以看到日志中出现 durationDeforeRetry 的日志。

alt text

快速恢复办法是删除 Pod 触发新建 Pod:kubectl delete pod xxx

但异常卸载导致 Pod 残留,kubelet 在处理残留 Pod 时,会提示 Pod 的 Volumes 目录的存储卷未卸载,导致无法删除。

从日志看像是NFS的CSI插件没有正确处理这种异常情况:

Aug 19 23:51:30 prhxhu5im6b4he kubelet[1596349]: E0819 23:51:30.898516 1596349 kubelet_volumes.go:179] orphaned pod "0dc214d7-f193-4ad9-ba9f-3eb8959d595c" found, but failed to rmdir() volume at path /var/lib/kubelet/pods/0dc214d7-f193-4ad9-ba9f-3eb8959d595c/volumes/kubernetes.io~csi/pvc-aee1cadb-8a63-4152-9dc7-f2c840019851: directory not empty : There were a total of 12 errors similar to this. Turn up verbosity to see them.
Aug 19 23:51:31 prhxhu5im6b4he kubelet[1596349]: I0819 23:51:31.107416 1596349 operation_generator.go:657] MountVolume.SetUp succeeded for volume "pvc-af7c2117-da24-4388-aec3-6a9a7cdc765f" (UniqueName: "kubernetes.io/csi/nfs.csi.k8s.io^10.255.176.233/share_72adb8c2b/pvc-af7c2117-da24-4388-aec3-6a9a7cdc765f") pod "nginx-01-68467b8cc5-qsp5k" (UID: "8054e2c8-1a06-4683-8f69-23a22729fd71")
Aug 19 23:51:32 prhxhu5im6b4he kubelet[1596349]: E0819 23:51:32.887278 1596349 kubelet_volumes.go:179] orphaned pod "0dc214d7-f193-4ad9-ba9f-3eb8959d595c" found, but failed to rmdir() volume at path /var/lib/kubelet/pods/0dc214d7-f193-4ad9-ba9f-3eb8959d595c/volumes/kubernetes.io~csi/pvc-aee1cadb-8a63-4152-9dc7-f2c840019851: directory not empty : There were a total of 12 errors similar to this. Turn up verbosity to see them.
Aug 19 23:51:34 prhxhu5im6b4he kubelet[1596349]: E0819 23:51:34.898730 1596349 kubelet_volumes.go:179] orphaned pod "0dc214d7-f193-4ad9-ba9f-3eb8959d595c" found, but failed to rmdir() volume at path /var/lib/kubelet/pods/0dc214d7-f193-4ad9-ba9f-3eb8959d595c/volumes/kubernetes.io~csi/pvc-aee1cadb-8a63-4152-9dc7-f2c840019851: directory not empty : There were a total of 12 errors similar to this. Turn up verbosity to see them.
Aug 19 23:51:36 prhxhu5im6b4he kubelet[1596349]: E0819 23:51:36.899249 1596349 kubelet_volumes.go:179] orphaned pod "0dc214d7-f193-4ad9-ba9f-3eb8959d595c" found, but failed to rmdir() volume at path /var/lib/kubelet/pods/0dc214d7-f193-4ad9-ba9f-3eb8959d595c/volumes/kubernetes.io~csi/pvc-aee1cadb-8a63-4152-9dc7-f2c840019851: directory not empty : There were a total of 12 errors similar to this. Turn up verbosity to see them.

Aug 19 23:51:38 prhxhu5im6b4he kubelet[1596349]: E0819 23:51:38.899951 1596349 kubelet_volumes.go:179] orphaned pod "0dc214d7-f193-4ad9-ba9f-3eb8959d595c" found, but failed to rmdir() volume at path /var/lib/kubelet/pods/0dc214d7-f193-4ad9-ba9f-3eb8959d595c/volumes/kubernetes.io~csi/pvc-aee1cadb-8a63-4152-9dc7-f2c840019851: directory not empty : There were a total of 12 errors similar to this. Turn up verbosity to see them.

可以导出 kubelet 日志,grep 出 rmdir 的异常日志,获取到这些处于 /var/lib/kubelet/pods/xxx/volumes/kubernetes.io~csi/pvc-xxx/mount 的路径执行umount,最后使用 rmdir 删除目录。

接下来使用 time 和 strace 继续分析。

挂载NFS存储卷

[root@prj4diysa6nosv ~]#  /usr/bin/time -v mount -t nfs -o vers=3,nolock 10.255.176.233:/share_72adb8c2b /mnt
	Command being timed: "mount -t nfs -o vers=3,nolock 10.255.176.233:/share_72adb8c2b /mnt"
	User time (seconds): 0.04
	System time (seconds): 0.07
	Percent of CPU this job got: 3%
	Elapsed (wall clock) time (h:mm:ss or m:ss): 0:03.14
	Average shared text size (kbytes): 0
	Average unshared data size (kbytes): 0
	Average stack size (kbytes): 0
	Average total size (kbytes): 0
	Maximum resident set size (kbytes): 204608
	Average resident set size (kbytes): 0
	Major (requiring I/O) page faults: 0
	Minor (reclaiming a frame) page faults: 3358
	Voluntary context switches: 25
	Involuntary context switches: 1
	Swaps: 0
	File system inputs: 0
	File system outputs: 0
	Socket messages sent: 0
	Socket messages received: 0
	Signals delivered: 0
	Page size (bytes): 65536
	Exit status: 0

卸载NFS存储卷

[root@prj4diysa6nosv ~]# /usr/bin/time -v umount /mnt
	Command being timed: "umount /mnt"
	User time (seconds): 0.03
	System time (seconds): 0.04
	Percent of CPU this job got: 2%
	Elapsed (wall clock) time (h:mm:ss or m:ss): 0:03.19
	Average shared text size (kbytes): 0
	Average unshared data size (kbytes): 0
	Average stack size (kbytes): 0
	Average total size (kbytes): 0
	Maximum resident set size (kbytes): 204608
	Average resident set size (kbytes): 0
	Major (requiring I/O) page faults: 0
	Minor (reclaiming a frame) page faults: 3370
	Voluntary context switches: 17
	Involuntary context switches: 1
	Swaps: 0
	File system inputs: 0
	File system outputs: 0
	Socket messages sent: 0
	Socket messages received: 0
	Signals delivered: 0
	Page size (bytes): 65536
	Exit status: 0

可以看到一次操作确确实实占用了 200MB 内存,而且输出信息显示 PAGESIZE 大小为 64KB。

我切换到另一个 AMD64 的系统上测试挂载,结果显示内存占用只有 3984KB,PAGESIZE 大小为 4KB。

➜  ~ /usr/bin/time -v mount -t nfs -o vers=3,nolock,proto=tcp 192.168.123.2:/local-zfs/data /tmp/nfs
	Command being timed: "mount -t nfs -o vers=3,nolock,proto=tcp 192.168.123.2:/local-zfs/data /tmp/nfs"
	User time (seconds): 0.00
	System time (seconds): 0.00
	Percent of CPU this job got: 22%
	Elapsed (wall clock) time (h:mm:ss or m:ss): 0:00.02
	Average shared text size (kbytes): 0
	Average unshared data size (kbytes): 0
	Average stack size (kbytes): 0
	Average total size (kbytes): 0
	Maximum resident set size (kbytes): 3984
	Average resident set size (kbytes): 0
	Major (requiring I/O) page faults: 0
	Minor (reclaiming a frame) page faults: 416
	Voluntary context switches: 26
	Involuntary context switches: 3
	Swaps: 0
	File system inputs: 0
	File system outputs: 0
	Socket messages sent: 0
	Socket messages received: 0
	Signals delivered: 0
	Page size (bytes): 4096
	Exit status: 0
➜  ~ /usr/bin/time -v umount /tmp/nfs
	Command being timed: "umount /tmp/nfs"
	User time (seconds): 0.00
	System time (seconds): 0.00
	Percent of CPU this job got: 36%
	Elapsed (wall clock) time (h:mm:ss or m:ss): 0:00.01
	Average shared text size (kbytes): 0
	Average unshared data size (kbytes): 0
	Average stack size (kbytes): 0
	Average total size (kbytes): 0
	Maximum resident set size (kbytes): 3880
	Average resident set size (kbytes): 0
	Major (requiring I/O) page faults: 2
	Minor (reclaiming a frame) page faults: 427
	Voluntary context switches: 18
	Involuntary context switches: 1
	Swaps: 0
	File system inputs: 80
	File system outputs: 0
	Socket messages sent: 0
	Socket messages received: 0
	Signals delivered: 0
	Page size (bytes): 4096
	Exit status: 0

再使用 strace 打印系统调用:

[root@prj4diysa6nosv ~]# strace mount -t nfs -o vers=3,nolock 10.255.176.233:/share_72adb8c2b /mnt
execve("/usr/bin/mount", ["mount", "-t", "nfs", "-o", "vers=3,nolock", "10.255.176.233:/share_72adb8c2b", "/mnt"], 0xffffcc280630 /* 28 vars */) = 0
brk(NULL)                               = 0xaaaaf2d40000
faccessat(AT_FDCWD, "/etc/ld.so.preload", R_OK) = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, "/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 3
fstat(3, {st_mode=S_IFREG|0644, st_size=20812, ...}) = 0
mmap(NULL, 20812, PROT_READ, MAP_PRIVATE, 3, 0) = 0xffffb73f0000
close(3)                                = 0
openat(AT_FDCWD, "/lib64/libmount.so.1", O_RDONLY|O_CLOEXEC) = 3
read(3, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0\267\0\1\0\0\0\260\317\0\0\0\0\0\0"..., 832) = 832
fstat(3, {st_mode=S_IFREG|0755, st_size=403992, ...}) = 0
mmap(NULL, 464112, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0xffffb7330000
mmap(0xffffb7390000, 131072, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x50000) = 0xffffb7390000
close(3)                                = 0
openat(AT_FDCWD, "/lib64/libblkid.so.1", O_RDONLY|O_CLOEXEC) = 3
read(3, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0\267\0\1\0\0\0\0\250\0\0\0\0\0\0"..., 832) = 832
fstat(3, {st_mode=S_IFREG|0755, st_size=339288, ...}) = 0
mmap(NULL, 399088, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0xffffb72c0000
mmap(0xffffb7310000, 131072, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x40000) = 0xffffb7310000
close(3)                                = 0
openat(AT_FDCWD, "/lib64/libuuid.so.1", O_RDONLY|O_CLOEXEC) = 3
read(3, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0\267\0\1\0\0\0\260\27\0\0\0\0\0\0"..., 832) = 832
fstat(3, {st_mode=S_IFREG|0755, st_size=70080, ...}) = 0
mmap(NULL, 131096, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0xffffb7290000
mmap(0xffffb72a0000, 65536, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0) = 0xffffb72a0000
mmap(0xffffb72b0000, 24, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0xffffb72b0000
close(3)                                = 0
openat(AT_FDCWD, "/lib64/librt.so.1", O_RDONLY|O_CLOEXEC) = 3
read(3, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0\267\0\1\0\0\0\220\36\0\0\0\0\0\0"..., 832) = 832
fstat(3, {st_mode=S_IFREG|0755, st_size=136656, ...}) = 0
mmap(NULL, 131872, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0xffffb7260000
mmap(0xffffb7270000, 131072, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0) = 0xffffb7270000
close(3)                                = 0
openat(AT_FDCWD, "/lib64/libselinux.so.1", O_RDONLY|O_CLOEXEC) = 3
read(3, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0\267\0\1\0\0\0\20o\0\0\0\0\0\0"..., 832) = 832
fstat(3, {st_mode=S_IFREG|0755, st_size=344872, ...}) = 0
mmap(NULL, 271744, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0xffffb7210000
mmap(0xffffb7240000, 131072, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x20000) = 0xffffb7240000
close(3)                                = 0
openat(AT_FDCWD, "/lib64/libc.so.6", O_RDONLY|O_CLOEXEC) = 3
read(3, "\177ELF\2\1\1\3\0\0\0\0\0\0\0\0\3\0\267\0\1\0\0\0000\r\2\0\0\0\0\0"..., 832) = 832
fstat(3, {st_mode=S_IFREG|0755, st_size=4097512, ...}) = 0
mmap(NULL, 1527680, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0xffffb7090000
mmap(0xffffb71f0000, 131072, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x150000) = 0xffffb71f0000
close(3)                                = 0
openat(AT_FDCWD, "/lib64/libpthread.so.0", O_RDONLY|O_CLOEXEC) = 3
read(3, "\177ELF\2\1\1\3\0\0\0\0\0\0\0\0\3\0\267\0\1\0\0\0\0d\0\0\0\0\0\0"..., 832) = 832
fstat(3, {st_mode=S_IFREG|0755, st_size=531384, ...}) = 0
mmap(NULL, 214016, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0xffffb7050000
mmap(0xffffb7070000, 131072, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x10000) = 0xffffb7070000
close(3)                                = 0
openat(AT_FDCWD, "/lib64/libpcre2-8.so.0", O_RDONLY|O_CLOEXEC) = 3
read(3, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0\267\0\1\0\0\0 $\0\0\0\0\0\0"..., 832) = 832
fstat(3, {st_mode=S_IFREG|0755, st_size=540032, ...}) = 0
mmap(NULL, 590272, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0xffffb6fb0000
mmap(0xffffb7030000, 131072, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x70000) = 0xffffb7030000
close(3)                                = 0
openat(AT_FDCWD, "/lib64/libdl.so.2", O_RDONLY|O_CLOEXEC) = 3
read(3, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0\267\0\1\0\0\0P\17\0\0\0\0\0\0"..., 832) = 832
fstat(3, {st_mode=S_IFREG|0755, st_size=93528, ...}) = 0
mmap(NULL, 131320, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0xffffb6f80000
mmap(0xffffb6f90000, 131072, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0) = 0xffffb6f90000
close(3)                                = 0
mprotect(0xffffb71f0000, 65536, PROT_READ) = 0
mprotect(0xffffb6f90000, 65536, PROT_READ) = 0
mprotect(0xffffb7070000, 65536, PROT_READ) = 0
mprotect(0xffffb7030000, 65536, PROT_READ) = 0
mprotect(0xffffb7240000, 65536, PROT_READ) = 0
mprotect(0xffffb7270000, 65536, PROT_READ) = 0
mprotect(0xffffb72a0000, 65536, PROT_READ) = 0
mprotect(0xffffb7310000, 65536, PROT_READ) = 0
mprotect(0xffffb7390000, 65536, PROT_READ) = 0
mprotect(0xaaaadce70000, 65536, PROT_READ) = 0
mprotect(0xffffb7400000, 65536, PROT_READ) = 0
munmap(0xffffb73f0000, 20812)           = 0
set_tid_address(0xffffb7415f50)         = 1137508
set_robust_list(0xffffb7415f60, 24)     = 0
rt_sigaction(SIGRTMIN, {sa_handler=0xffffb7055e80, sa_mask=[], sa_flags=SA_SIGINFO}, NULL, 8) = 0
rt_sigaction(SIGRT_1, {sa_handler=0xffffb7055f40, sa_mask=[], sa_flags=SA_RESTART|SA_SIGINFO}, NULL, 8) = 0
rt_sigprocmask(SIG_UNBLOCK, [RTMIN RT_1], NULL, 8) = 0
prlimit64(0, RLIMIT_STACK, NULL, {rlim_cur=8192*1024, rlim_max=RLIM64_INFINITY}) = 0
statfs("/sys/fs/selinux", {f_type=SELINUX_MAGIC, f_bsize=65536, f_blocks=0, f_bfree=0, f_bavail=0, f_files=0, f_ffree=0, f_fsid={val=[0, 0]}, f_namelen=255, f_frsize=65536, f_flags=ST_VALID|ST_RELATIME}) = 0
statfs("/sys/fs/selinux", {f_type=SELINUX_MAGIC, f_bsize=65536, f_blocks=0, f_bfree=0, f_bavail=0, f_files=0, f_ffree=0, f_fsid={val=[0, 0]}, f_namelen=255, f_frsize=65536, f_flags=ST_VALID|ST_RELATIME}) = 0
brk(NULL)                               = 0xaaaaf2d40000
brk(0xaaaaf2d70000)                     = 0xaaaaf2d70000
faccessat(AT_FDCWD, "/etc/selinux/config", F_OK) = 0
openat(AT_FDCWD, "/usr/lib/locale/locale-archive", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, "/usr/share/locale/locale.alias", O_RDONLY|O_CLOEXEC) = 3
fstat(3, {st_mode=S_IFREG|0644, st_size=2997, ...}) = 0
read(3, "# Locale name alias data base.\n#"..., 8192) = 2997
read(3, "", 8192)                       = 0
close(3)                                = 0
openat(AT_FDCWD, "/usr/lib/locale/en_US.UTF-8/LC_IDENTIFICATION", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, "/usr/lib/locale/en_US.utf8/LC_IDENTIFICATION", O_RDONLY|O_CLOEXEC) = 3
fstat(3, {st_mode=S_IFREG|0644, st_size=368, ...}) = 0
mmap(NULL, 368, PROT_READ, MAP_PRIVATE, 3, 0) = 0xffffb73f0000
close(3)                                = 0
openat(AT_FDCWD, "/usr/lib64/gconv/gconv-modules.cache", O_RDONLY) = 3
fstat(3, {st_mode=S_IFREG|0644, st_size=26998, ...}) = 0
mmap(NULL, 26998, PROT_READ, MAP_SHARED, 3, 0) = 0xffffb6f70000
close(3)                                = 0
futex(0xffffb7201768, FUTEX_WAKE_PRIVATE, 2147483647) = 0
openat(AT_FDCWD, "/usr/lib/locale/en_US.UTF-8/LC_MEASUREMENT", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, "/usr/lib/locale/en_US.utf8/LC_MEASUREMENT", O_RDONLY|O_CLOEXEC) = 3
fstat(3, {st_mode=S_IFREG|0644, st_size=23, ...}) = 0
mmap(NULL, 23, PROT_READ, MAP_PRIVATE, 3, 0) = 0xffffb6f60000
close(3)                                = 0
openat(AT_FDCWD, "/usr/lib/locale/en_US.UTF-8/LC_TELEPHONE", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, "/usr/lib/locale/en_US.utf8/LC_TELEPHONE", O_RDONLY|O_CLOEXEC) = 3
fstat(3, {st_mode=S_IFREG|0644, st_size=59, ...}) = 0
mmap(NULL, 59, PROT_READ, MAP_PRIVATE, 3, 0) = 0xffffb6f50000
close(3)                                = 0
openat(AT_FDCWD, "/usr/lib/locale/en_US.UTF-8/LC_ADDRESS", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, "/usr/lib/locale/en_US.utf8/LC_ADDRESS", O_RDONLY|O_CLOEXEC) = 3
fstat(3, {st_mode=S_IFREG|0644, st_size=167, ...}) = 0
mmap(NULL, 167, PROT_READ, MAP_PRIVATE, 3, 0) = 0xffffb6f40000
close(3)                                = 0
openat(AT_FDCWD, "/usr/lib/locale/en_US.UTF-8/LC_NAME", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, "/usr/lib/locale/en_US.utf8/LC_NAME", O_RDONLY|O_CLOEXEC) = 3
fstat(3, {st_mode=S_IFREG|0644, st_size=77, ...}) = 0
mmap(NULL, 77, PROT_READ, MAP_PRIVATE, 3, 0) = 0xffffb6f30000
close(3)                                = 0
openat(AT_FDCWD, "/usr/lib/locale/en_US.UTF-8/LC_PAPER", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, "/usr/lib/locale/en_US.utf8/LC_PAPER", O_RDONLY|O_CLOEXEC) = 3
fstat(3, {st_mode=S_IFREG|0644, st_size=34, ...}) = 0
mmap(NULL, 34, PROT_READ, MAP_PRIVATE, 3, 0) = 0xffffb6f20000
close(3)                                = 0
openat(AT_FDCWD, "/usr/lib/locale/en_US.UTF-8/LC_MESSAGES", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, "/usr/lib/locale/en_US.utf8/LC_MESSAGES", O_RDONLY|O_CLOEXEC) = 3
fstat(3, {st_mode=S_IFDIR|0755, st_size=29, ...}) = 0
close(3)                                = 0
openat(AT_FDCWD, "/usr/lib/locale/en_US.utf8/LC_MESSAGES/SYS_LC_MESSAGES", O_RDONLY|O_CLOEXEC) = 3
fstat(3, {st_mode=S_IFREG|0644, st_size=57, ...}) = 0
mmap(NULL, 57, PROT_READ, MAP_PRIVATE, 3, 0) = 0xffffb6f10000
close(3)                                = 0
openat(AT_FDCWD, "/usr/lib/locale/en_US.UTF-8/LC_MONETARY", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, "/usr/lib/locale/en_US.utf8/LC_MONETARY", O_RDONLY|O_CLOEXEC) = 3
fstat(3, {st_mode=S_IFREG|0644, st_size=286, ...}) = 0
mmap(NULL, 286, PROT_READ, MAP_PRIVATE, 3, 0) = 0xffffb6f00000
close(3)                                = 0
openat(AT_FDCWD, "/usr/lib/locale/en_US.UTF-8/LC_COLLATE", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, "/usr/lib/locale/en_US.utf8/LC_COLLATE", O_RDONLY|O_CLOEXEC) = 3
fstat(3, {st_mode=S_IFREG|0644, st_size=2586930, ...}) = 0
mmap(NULL, 2586930, PROT_READ, MAP_PRIVATE, 3, 0) = 0xffffb6c80000
close(3)                                = 0
openat(AT_FDCWD, "/usr/lib/locale/en_US.UTF-8/LC_TIME", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, "/usr/lib/locale/en_US.utf8/LC_TIME", O_RDONLY|O_CLOEXEC) = 3
fstat(3, {st_mode=S_IFREG|0644, st_size=3316, ...}) = 0
mmap(NULL, 3316, PROT_READ, MAP_PRIVATE, 3, 0) = 0xffffb6c70000
close(3)                                = 0
openat(AT_FDCWD, "/usr/lib/locale/en_US.UTF-8/LC_NUMERIC", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, "/usr/lib/locale/en_US.utf8/LC_NUMERIC", O_RDONLY|O_CLOEXEC) = 3
fstat(3, {st_mode=S_IFREG|0644, st_size=54, ...}) = 0
mmap(NULL, 54, PROT_READ, MAP_PRIVATE, 3, 0) = 0xffffb6c60000
close(3)                                = 0
openat(AT_FDCWD, "/usr/lib/locale/en_US.UTF-8/LC_CTYPE", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, "/usr/lib/locale/en_US.utf8/LC_CTYPE", O_RDONLY|O_CLOEXEC) = 3
fstat(3, {st_mode=S_IFREG|0644, st_size=337024, ...}) = 0
mmap(NULL, 337024, PROT_READ, MAP_PRIVATE, 3, 0) = 0xffffb6c00000
close(3)                                = 0
getuid()                                = 0
geteuid()                               = 0
newfstatat(AT_FDCWD, "/mnt", {st_mode=S_IFDIR|0755, st_size=6, ...}, AT_SYMLINK_NOFOLLOW) = 0
newfstatat(AT_FDCWD, "/sbin/mount.nfs", {st_mode=S_IFREG|S_ISUID|0755, st_size=208136, ...}, 0) = 0
getuid()                                = 0
geteuid()                               = 0
getgid()                                = 0
getegid()                               = 0
prctl(PR_GET_DUMPABLE)                  = 1 (SUID_DUMP_USER)
newfstatat(AT_FDCWD, "/run", {st_mode=S_IFDIR|0755, st_size=900, ...}, 0) = 0
newfstatat(AT_FDCWD, "/run/mount/utab", {st_mode=S_IFREG|0644, st_size=0, ...}, AT_SYMLINK_NOFOLLOW) = 0
newfstatat(AT_FDCWD, "/run/mount/utab", {st_mode=S_IFREG|0644, st_size=0, ...}, 0) = 0
geteuid()                               = 0
getegid()                               = 0
getuid()                                = 0
getgid()                                = 0
faccessat(AT_FDCWD, "/run/mount/utab", R_OK|W_OK) = 0
clone(child_stack=NULL, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0xffffb7415f50) = 1137509
wait4(1137509, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], 0, NULL) = 1137509
--- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=1137509, si_uid=0, si_status=0, si_utime=0, si_stime=13} ---
close(1)                                = 0
close(2)                                = 0
exit_group(0)                           = ?
+++ exited with 0 +++

从输出日志中可以看到有两次尝试连接 NFS 服务器的操作,似乎是在测试服务器支持哪一种传输协议,卸载操作也有类似的日志,于是我测试在挂载参数中增加 proto=tcp 后,结果内存占用居然下降了一半。

alt text

看起来是在握手过程中分配的某些 buffer 占用了过多的空间,不过 NFS 模块集成在内核中,而且公司内部 ARM64 和 AMD64 的内核版本也是一致的,容器产品在两个架构使用的内核参数也是一致的,原因大概率是 ARM64 内核的 PAGESIZE 过大,造成所有涉及内存的操作都占用过多内存(NFS 内核代码中有大量使用 PAGESIZE 为单位分配内存的操作)。

由于 PAGESIZE 是在编译内核时设定的,而内核是由 OS 团队定制,Leader 找到 OS 团队负责人一番讨论后,OS 团队表示还是维持现状且没有给出明确原因……

2.4 Result

  1. 定位 ARM64 平台挂载存储卷失败原因,用户使用的操作系统 PAGESIZE 设定为 64KB,远远大于正常的 4KB,因此内存占用高
  2. 调整创建 NFS 存储类的挂载参数,增加 proto=tcp,这会让 ARM64 平台上的 mount.nfs 的内存占用减少一半
  3. 移除 csi-nfs-node 的内存 limit,允许使用更多的内存。

3. 补充信息

3.1 NFS服务器异常

在我定位 ARM64 的内存问题并处理后,依旧出现 NFS 存储无法挂载的问题,此时 csi-nfs-node 的 nfs 容器日志如下:

alt text

登录异常ECS节点执行 dmesg -T,看到有大量的 kernel: nfs: server xxxxx not responding, still retrying

经过长时间的排查,最终确认客户环境中四个存储服务器有两个存在异常,可以 ping 通服务器 IP,但用 mount 命令挂载这个服务器上的 nfs 共享存储时就会卡住,最终通过重启异常节点的 genesha 服务恢复。

在这个问题确认前,我与存储团队的 SRE 较量了整整两周,对方使用内核参数、挂载参数、使用方法等无数种使用不当理由拒绝技术支持,整个过程没有拉上一位熟悉 NFS 协议的存储研发人员协助排查,着实折磨了我一番。

3.2 PAGESIZE大小问题

PAGESIZE 的大小一般是根据处理器架构、操作系统设定的,比如搭载 macOS 14.5 的 M1 Pro 使用 16KB 大小:

➜  ~ uname -sr
Darwin 23.5.0
➜  ~ getconf PAGE_SIZE
16384

AMD64 架构的 Linux 系统一般使用 4KB,而我手机上的一台 Oracle ARM64 服务器也是使用 4KB:

➜  ~ uname -sr
Linux 6.5.0-1021-oracle
➜  ~ getconf PAGE_SIZE
4096
➜  ~ lscpu
Architecture:           aarch64
  CPU op-mode(s):       32-bit, 64-bit
  Byte Order:           Little Endian
CPU(s):                 4
  On-line CPU(s) list:  0-3
Vendor ID:              ARM
  Model name:           Neoverse-N1
......

而公司操作系统内核从 4.18 升级到 5.15 后,PAGE_SIZE 依旧是 64KB,只能猜测是为了适配国产处理器不得已为之,或者是最初的某个临时方案沿用至今~看来 mount.nfs 的问题永远无法解决。