Hello,
in environment in subject, I downloaded from glance repository the CentOS 8 image
CentOS 8 Generic Cloud Image v20200113.3 for x86_64 (5e35c84)
and imported as a template.
I created a vm based on it (I got the message: "In order to create a VM from a template with a different chipset, device configuration will be changed. This may affect functionality of the guest software. Are you sure you want to proceed?")
When running "dnf update", during I/O of packages updates the VM went into pause.
VM c8desktop started on Host
novirt2.example.net 5/28/201:29:04 PM
VM c8desktop has been paused. 5/28/201:41:52 PM
VM c8desktop has been paused due to unknown storage error. 5/28/201:41:52 PM
VM c8desktop has recovered from paused back to up. 5/28/201:43:50 PM
In messages of (nested) host I see:
May 28 13:28:06 novirt2 systemd-machined[1497]: New machine qemu-7-c8desktop.
May 28 13:28:06 novirt2 systemd[1]: Started Virtual Machine qemu-7-c8desktop.
May 28 13:28:06 novirt2 kvm[57798]: 2 guests now active
May 28 13:28:07 novirt2 journal[13368]: Guest agent is not responding: QEMU guest agent is not connected
May 28 13:28:12 novirt2 journal[13368]: Guest agent is not responding: QEMU guest agent is not connected
May 28 13:28:17 novirt2 journal[13368]: Guest agent is not responding: QEMU guest agent is not connected
May 28 13:28:22 novirt2 journal[13368]: Guest agent is not responding: QEMU guest agent is not connected
May 28 13:28:27 novirt2 journal[13368]: Guest agent is not responding: QEMU guest agent is not connected
May 28 13:28:32 novirt2 journal[13368]: Guest agent is not responding: QEMU guest agent is not connected
May 28 13:28:37 novirt2 journal[13368]: Domain id=7 name='c8desktop' uuid=63e27cb5-087d-435e-bf61-3fe25e3319d
6 is tainted: custom-ga-command
May 28 13:28:37 novirt2 journal[26984]: Cannot open log file: '/var/log/libvirt/qemu/c8desktop.log': Device o
r resource busy
May 28 13:28:37 novirt2 journal[13368]: Cannot open log file: '/var/log/libvirt/qemu/c8desktop.log': Device o
r resource busy
May 28 13:28:37 novirt2 journal[13368]: Unable to open domainlog
May 28 13:30:00 novirt2 systemd[1]: Starting system activity accounting tool...
May 28 13:30:00 novirt2 systemd[1]: Started system activity accounting tool.
May 28 13:37:21 novirt2 python3[62512]: detected unhandled Python exception in '/usr/lib/python3.6/site-packages/vdsm/gluster/gfapi.py'
May 28 13:37:21 novirt2 abrt-server[62514]: Deleting problem directory Python3-2020-05-28-13:37:21-62512 (dup of Python3-2020-05-28-10:32:57-29697)
May 28 13:37:21 novirt2 dbus-daemon[1502]: [system] Activating service name='org.freedesktop.problems' requested by ':1.3111' (uid=0 pid=62522 comm="/usr/libexec/platform-python /usr/bin/abrt-action-" label="system_u:system_r:abrt_t:s0-s0:c0.c1023") (using servicehelper)
May 28 13:37:21 novirt2 dbus-daemon[1502]: [system] Successfully activated service 'org.freedesktop.problems'
May 28 13:37:21 novirt2 abrt-server[62514]: /bin/sh: reporter-systemd-journal: command not found
May 28 13:37:21 novirt2 python3[62550]: detected unhandled Python exception in '/usr/lib/python3.6/site-packages/vdsm/gluster/gfapi.py'
May 28 13:37:21 novirt2 abrt-server[62552]: Not saving repeating crash in '/usr/lib/python3.6/site-packages/vdsm/gluster/gfapi.py'
May 28 13:37:22 novirt2 python3[62578]: detected unhandled Python exception in '/usr/lib/python3.6/site-packages/vdsm/gluster/gfapi.py'
May 28 13:37:22 novirt2 abrt-server[62584]: Not saving repeating crash in '/usr/lib/python3.6/site-packages/vdsm/gluster/gfapi.py'
May 28 13:40:00 novirt2 systemd[1]: Starting system activity accounting tool...
May 28 13:40:00 novirt2 systemd[1]: Started system activity accounting tool.
On the related gluster volume log where vm disk is I have:
[2020-05-28 11:41:33.892074] W [MSGID: 114031] [client-rpc-fops_v2.c:679:client4_0_writev_cbk] 0-vmstore-client-0: remote operation failed [Invalid argument]
[2020-05-28 11:41:33.892140] W [fuse-bridge.c:2925:fuse_writev_cbk] 0-glusterfs-fuse: 348168: WRITE => -1 gfid=35ae86e8-0ccd-48b8-9ef2-6ca9a108ccf9 fd=0x7fd1d800cf38 (Invalid argument)
[2020-05-28 11:41:33.902984] I [MSGID: 133022] [shard.c:3693:shard_delete_shards] 0-vmstore-shard: Deleted shards of gfid=35ae86e8-0ccd-48b8-9ef2-6ca9a108ccf9 from backend
[2020-05-28 11:41:52.434362] E [MSGID: 133010] [shard.c:2339:shard_common_lookup_shards_cbk] 0-vmstore-shard: Lookup on shard 6 failed. Base file gfid = 3e12e7fe-6a77-41b8-932a-d4f50c41ac00 [No such file or directory]
[2020-05-28 11:41:52.434423] W [fuse-bridge.c:2925:fuse_writev_cbk] 0-glusterfs-fuse: 353565: WRITE => -1 gfid=3e12e7fe-6a77-41b8-932a-d4f50c41ac00 fd=0x7fd208093fb8 (No such file or directory)
[2020-05-28 11:46:34.095697] W [MSGID: 114031] [client-rpc-fops_v2.c:679:client4_0_writev_cbk] 0-vmstore-client-0: remote operation failed [Invalid argument]
[2020-05-28 11:46:34.095758] W [fuse-bridge.c:2925:fuse_writev_cbk] 0-glusterfs-fuse: 384006: WRITE => -1 gfid=7b804a1a-1734-4bec-b8f4-9ba33ffefe8b fd=0x7fd1d0005fd8 (Invalid argument)
[2020-05-28 11:46:34.104494] I [MSGID: 133022] [shard.c:3693:shard_delete_shards] 0-vmstore-shard: Deleted shards of gfid=7b804a1a-1734-4bec-b8f4-9ba33ffefe8b from backend
Very similar sharding message when in 4.3.9 in single host with gluster and "heavy"/sudden I/O operations when using thin provisioned disks.....
I see in 4.4 Gluster is glusterfs-7.5-1.el8.x86_64
Can this be a problem that only appears in single host as there are indeed no data travellling to the network to sync nodes and sharding feature for some reason is not able to keep pace, when local disk is very fast?
In 4.3.9 single host with gluster 6.8-1 my only way to solve was to disable sharding.. and I got final stability see here
Still waiting for Gluster devs comments on logs provided at that time
As I already wrote, in my opinion the single host wizard should set sharding off in automatic, because in that environment can make thin provisioned disks unusable.
In case of future additions of nodes, the setup can make a check and say the user that he/she should re-enable sharding....
Just my 0.2 eurocent
Gianluca