Hi,
I'd like to share couple of observations on the scale system, just a few chaotic notes
mostly for the documentation purposes:-)
4 socket, 8 core, HT -> 64 CPU
samples has been taken in +- stable conditions
1) running 100VMs; top sample with collapsed process usage (all threads sum up)
top - 15:00:48 up 5 days, 6:31, 1 user, load average: 2.25, 1.90, 2.05
Tasks: 1989 total, 4 running, 1983 sleeping, 1 stopped, 1 zombie
Cpu(s): 4.7%us, 1.9%sy, 0.0%ni, 93.3%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Mem: 396900992k total, 80656384k used, 316244608k free, 156632k buffers
Swap: 41148408k total, 0k used, 41148408k free, 12225520k cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
9057 vdsm 0 -20 17.5g 482m 9m S 162.7 0.1 446:46.28 vdsm
8726 root 20 0 979m 17m 7996 R 135.6 0.0 68:56.42 libvirtd
36937 qemu 20 0 1615m 296m 6328 S 59.2 0.1 0:19.18 qemu-kvm
38174 root 20 0 16496 2724 880 R 32.1 0.0 0:00.54 top
2458 qemu 20 0 1735m 533m 6328 S 11.1 0.1 2:38.87 qemu-kvm
10203 qemu 20 0 1736m 511m 6328 S 11.1 0.1 2:32.53 qemu-kvm
27774 qemu 20 0 1730m 523m 6328 S 11.1 0.1 2:22.36 qemu-kvm
25208 qemu 20 0 1733m 514m 6328 S 9.9 0.1 2:22.47 qemu-kvm
51594 qemu 20 0 1733m 650m 6328 S 9.9 0.2 3:53.42 qemu-kvm
… etc
[ this one's not from stable conditions, unfortunately, as can be seen by PID 36937]
----------------------------
2) running 185 VMs - all threads sum up. VDSM has 411 threads; load is much higher; also
note high sys time
top - 07:10:28 up 5 days, 22:41, 1 user, load average: 19.10, 14.28, 13.17
Tasks: 2318 total, 9 running, 2308 sleeping, 0 stopped, 1 zombie
Cpu(s): 10.8%us, 21.0%sy, 0.0%ni, 67.8%id, 0.1%wa, 0.0%hi, 0.2%si, 0.0%st
Mem: 396900992k total, 157267616k used, 239633376k free, 175700k buffers
Swap: 41148408k total, 0k used, 41148408k free, 12669856k cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
9057 vdsm 0 -20 30.1g 856m 10m S 883.4 0.2 9818:59 vdsm
8726 root 20 0 975m 19m 7996 R 142.4 0.0 1370:15 libvirtd
19542 root 20 0 16700 3108 1020 R 15.1 0.0 0:18.11 top
17614 qemu 20 0 1730m 692m 6328 S 6.5 0.2 49:05.53 qemu-kvm
55545 qemu 20 0 1732m 708m 6328 S 6.3 0.2 48:42.01 qemu-kvm
28542 qemu 20 0 1724m 696m 6328 S 6.2 0.2 44:44.50 qemu-kvm
12482 qemu 20 0 1738m 822m 6328 S 6.0 0.2 51:02.71 qemu-kvm
… etc
break-up per thread:
top - 07:05:43 up 5 days, 22:36, 1 user, load average: 12.50, 11.15, 12.00
Tasks: 3357 total, 35 running, 3321 sleeping, 0 stopped, 1 zombie
Cpu(s): 11.3%us, 16.0%sy, 0.0%ni, 72.5%id, 0.0%wa, 0.0%hi, 0.2%si, 0.0%st
Mem: 396900992k total, 157238240k used, 239662752k free, 175700k buffers
Swap: 41148408k total, 0k used, 41148408k free, 12669856k cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
8726 root 20 0 979m 19m 7996 R 81.6 0.0 860:14.21 libvirtd
9103 vdsm 0 -20 30.1g 855m 10m R 63.7 0.2 502:03.90 vdsm
9144 vdsm 0 -20 30.1g 855m 10m R 59.5 0.2 276:23.93 vdsm
11150 vdsm 0 -20 30.1g 855m 10m R 59.0 0.2 399:37.86 vdsm
9100 vdsm 0 -20 30.1g 855m 10m R 42.7 0.2 354:41.75 vdsm
18053 root 20 0 17708 3968 1020 R 17.2 0.0 0:17.46 top
11845 vdsm 0 -20 30.1g 855m 10m S 15.4 0.2 114:48.08 vdsm
8755 root 20 0 979m 19m 7996 S 13.0 0.0 81:25.64 libvirtd
8753 root 20 0 979m 19m 7996 S 12.7 0.0 81:21.16 libvirtd
64396 root 20 0 979m 19m 7996 S 12.4 0.0 80:03.68 libvirtd
8754 root 20 0 979m 19m 7996 R 10.0 0.0 81:26.52 libvirtd
8751 root 20 0 979m 19m 7996 S 9.9 0.0 81:18.83 libvirtd
8752 root 20 0 979m 19m 7996 R 9.7 0.0 81:28.07 libvirtd
52567 vdsm 0 -20 30.1g 855m 10m S 4.9 0.2 9:27.75 vdsm
30617 vdsm 0 -20 30.1g 855m 10m S 3.8 0.2 34:40.75 vdsm
40621 vdsm 0 -20 30.1g 855m 10m S 3.8 0.2 34:12.40 vdsm
8952 vdsm 0 -20 30.1g 855m 10m S 3.8 0.2 24:03.79 vdsm
29818 vdsm 0 -20 30.1g 855m 10m S 3.7 0.2 9:21.33 vdsm
31418 vdsm 0 -20 30.1g 855m 10m S 3.7 0.2 35:09.03 vdsm
6858 vdsm 0 -20 30.1g 855m 10m S 3.7 0.2 34:10.79 vdsm
18513 vdsm 0 -20 30.1g 855m 10m S 3.7 0.2 34:44.03 vdsm
46247 vdsm 0 -20 30.1g 855m 10m S 3.7 0.2 34:16.65 vdsm
50759 vdsm 0 -20 30.1g 855m 10m S 3.7 0.2 34:04.86 vdsm
58612 vdsm 0 -20 30.1g 855m 10m S 3.7 0.2 31:58.11 vdsm
25872 vdsm 0 -20 30.1g 855m 10m S 3.7 0.2 31:03.81 vdsm
31599 vdsm 0 -20 30.1g 855m 10m S 3.7 0.2 31:10.85 vdsm
… etc
overall network usage:
0.5-3Mbps varying, roughly corresponding to 15s
----------------------------
3) special case when vdsm was down, 185 VMs
top - 08:34:11 up 6 days, 4 min, 2 users, load average: 5.96, 5.20, 8.71
Tasks: 2314 total, 7 running, 2306 sleeping, 0 stopped, 1 zombie
Cpu(s): 8.6%us, 4.9%sy, 0.0%ni, 86.1%id, 0.0%wa, 0.0%hi, 0.4%si, 0.0%st
Mem: 396900992k total, 157512324k used, 239388668k free, 180100k buffers
Swap: 41148408k total, 0k used, 41148408k free, 12726620k cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
47776 root 20 0 17212 3592 1020 R 16.5 0.0 0:16.32 top
35549 qemu 20 0 1806m 809m 6328 S 6.7 0.2 53:46.20 qemu-kvm
43240 qemu 20 0 1735m 694m 6328 S 6.5 0.2 44:58.63 qemu-kvm
51594 qemu 20 0 1733m 822m 6328 S 6.3 0.2 54:19.51 qemu-kvm
24881 qemu 20 0 1726m 704m 6328 S 6.1 0.2 48:50.63 qemu-kvm
58563 qemu 20 0 1728m 699m 6328 S 6.1 0.2 50:14.10 qemu-kvm
… etc
--------------------------
4) no disk space on /var/log (opened BZ 1115357):
disk space - libvirtd 8726 root 4w REG 253,8 1638998016 34
/var/log/libvirtd.log (deleted)
--------------------------
5) startup of vdsm in 185 VMs environment:
on vdsm service startup the "vdsm: Running nwfilter" took ~5 minutes to finish
then VM recovery took ~20-30 minutes!
overall we need to identify the specific threads and simulate specific issues in a
debug-friendly environment to tell a bit more…
Thanks,
michal