Cannot Start VM After Pausing due to Storage I/O Error

Noticed I had a VM that was 'paused' due to a 'Storage I/O error. I inherited this system from another admin and have no idea where to start figuring this out. We have a 4-node Ovirt cluster with a 5 Manager node. The VM in question is running on a host vm-host-colo-4. Best I can tell the VMs run on a gluster replicated volume replicated between all 4 nodes, with node 1 running as an arbiter node for the gluster volume. Other VMs are running on this host 4 so not sure what the issue is with this one VM. When I look at the status of the gluster volume for this host, I see the self-heal info for the bricks is listed as 'N/A' for this host. All the other hosts in the cluster list this info as 'OK'. When I cd into the gluster directory on host 4, I don't see the same things as I do on the other hosts. I am not sure this is an issue but its just different. When running various gluster commands gluster seems to respond. See below: [root@vm-host-colo-4 gluster]# gluster volume info all Volume Name: gl-colo-1 Type: Replicate Volume ID: 2c545e19-9468-487e-9e9b-cd3202fc24c4 Status: Started Snapshot Count: 0 Number of Bricks: 1 x (2 + 1) = 3 Transport-type: tcp Bricks: Brick1: 10.20.101.181:/gluster/gl-colo-1/brick1 Brick2: 10.20.101.183:/gluster/gl-colo-1/brick1 Brick3: 10.20.101.185:/gluster/gl-colo-1/brick1 (arbiter) Options Reconfigured: network.ping-timeout: 30 cluster.granular-entry-heal: enable performance.strict-o-direct: on storage.owner-gid: 36 storage.owner-uid: 36 cluster.choose-local: off features.shard: on cluster.shd-wait-qlength: 10000 cluster.shd-max-threads: 8 cluster.locking-scheme: granular cluster.data-self-heal-algorithm: full cluster.server-quorum-type: server cluster.quorum-type: auto cluster.eager-lock: enable network.remote-dio: off performance.low-prio-threads: 32 performance.io-cache: off performance.read-ahead: off performance.quick-read: off auth.allow: * user.cifs: off transport.address-family: inet nfs.disable: on performance.client-io-threads: off Volume Name: gl-vm-host-4 Type: Distribute Volume ID: a2ba6b29-2366-4a7e-bda8-2e0574cf4afa Status: Started Snapshot Count: 0 Number of Bricks: 1 Transport-type: tcp Bricks: Brick1: 10.20.101.187:/gluster/gl-vm-host-colo-4 Options Reconfigured: network.ping-timeout: 30 cluster.granular-entry-heal: enable network.remote-dio: off performance.strict-o-direct: on storage.owner-gid: 36 storage.owner-uid: 36 auth.allow: * user.cifs: disable transport.address-family: inet nfs.disable: on [root@vm-host-colo-4 gluster]# [root@vm-host-colo-4 gluster]# gluster-eventsapi status Webhooks: http://mydesktop.altn.int:80/ovirt-engine/services/glusterevents +-------------------------+-------------+-----------------------+ | NODE | NODE STATUS | GLUSTEREVENTSD STATUS | +-------------------------+-------------+-----------------------+ | vm-host-colo-1.altn.int | UP | OK | | vm-idev-colo-1.altn.int | UP | OK | | vm-host-colo-2.altn.int | UP | OK | | localhost | UP | OK | +-------------------------+-------------+-----------------------+ [root@vm-host-colo-4 gluster]# gluster volume status gl-vm-host-4 Status of volume: gl-vm-host-4 Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Brick 10.20.101.187:/gluster/gl-vm-host-col o-4 49152 0 Y 33221 Task Status of Volume gl-vm-host-4 ------------------------------------------------------------------------------ There are no active volume tasks I also get a timeout error when doing a 'gluster volume status' on this node. So while some aspects seem fine with the gluster volume, some don't. Should I restart the glusterd daemon or will that mess things up? I am not sure if this is due to something wrong with the gluster volume or with the vm-host's ability to access the data for the VM disk, meaning a true I/O problem. There are two VMs in this state both running on this host and I am not sure how to proceed to get them running again. Should I force this VM to be on a different host by editing the VM or should I try and make it work on the host its on. As mentioned, many other VMs are running on this host so not sure why these two have an issue. Up front apologies here. I am a network engineer and not a VM/Ovirt expert. This was dropped in my lap due to a layoff and could use some help on where to go from here. Thanks in advance for any help.

OK looking into this further, it seems that I was incorrect about the Gluster situation on these hosts. There are three hosts which form a replicated arbitrated gluster volume between them. There is a second gluster volume which is distributed but only seems to exist on the 4 host (the one the failed VMs are running on. I am not sure why this was set up this way. After all that I looked closer and realized there is an NFS share which these VMs disks are located on. I can access and write to this NFS share from the host itself so I am not sure what the story is as to why I cannot get this VM to come up. Seems like everything is there for it to do so. The error I get when trying to start up the VM is the following: <vm-name> has been paused due to storage I/O problem. How can I determine what this I/O problem actually is? IS the disk file corrupted somehow? Both VMs that won;t start are using this NFS share. Others that are running also use it so I am ot sure what the problem here is or where to start looking for an answer. Thanks in advance for your help.

Moin, Normally you can use virsh to start vm which are in pause. https://www.google.com/amp/s/www.cyberciti.biz/faq/linux-list-a-kvm-vm-guest... You have to use saslpasswd2 -a libvirt username to get access on comandline. It is importand that you Resum that vm on the same host where it was paused. Good luck... Br Marcel Am 9. September 2021 22:10:52 MESZ schrieb "bob.franzke--- via Users" <users@ovirt.org>:
OK looking into this further, it seems that I was incorrect about the Gluster situation on these hosts. There are three hosts which form a replicated arbitrated gluster volume between them. There is a second gluster volume which is distributed but only seems to exist on the 4 host (the one the failed VMs are running on. I am not sure why this was set up this way.
After all that I looked closer and realized there is an NFS share which these VMs disks are located on. I can access and write to this NFS share from the host itself so I am not sure what the story is as to why I cannot get this VM to come up. Seems like everything is there for it to do so. The error I get when trying to start up the VM is the following:
<vm-name> has been paused due to storage I/O problem.
How can I determine what this I/O problem actually is? IS the disk file corrupted somehow? Both VMs that won;t start are using this NFS share. Others that are running also use it so I am ot sure what the problem here is or where to start looking for an answer. Thanks in advance for your help. _______________________________________________ Users mailing list -- users@ovirt.org To unsubscribe send an email to users-leave@ovirt.org Privacy Statement: https://www.ovirt.org/privacy-policy.html oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/users@ovirt.org/message/EX7TAYPEKKVEXQ...

Can you provide the output from all nodes: gluster pool listgluster peer statusgluster volume status Best Regards,Strahil Nikolov On Fri, Sep 10, 2021 at 0:50, marcel d'heureuse<marcel@deheureu.se> wrote: _______________________________________________ Users mailing list -- users@ovirt.org To unsubscribe send an email to users-leave@ovirt.org Privacy Statement: https://www.ovirt.org/privacy-policy.html oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/users@ovirt.org/message/ZNXYN2N2EAVKDT...

Which Qemu version are you using? Cause it might be related to https://bugzilla.redhat.com/show_bug.cgi?id=1994494 Jean-Louis On 10/09/2021 19:59, Strahil Nikolov via Users wrote:
Can you provide the output from all nodes:
gluster pool list gluster peer status gluster volume status
Best Regards, Strahil Nikolov
On Fri, Sep 10, 2021 at 0:50, marcel d'heureuse <marcel@deheureu.se> wrote: _______________________________________________ Users mailing list -- users@ovirt.org To unsubscribe send an email to users-leave@ovirt.org Privacy Statement: https://www.ovirt.org/privacy-policy.html oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/users@ovirt.org/message/ZNXYN2N2EAVKDT...
_______________________________________________ Users mailing list --users@ovirt.org To unsubscribe send an email tousers-leave@ovirt.org Privacy Statement:https://www.ovirt.org/privacy-policy.html oVirt Code of Conduct:https://www.ovirt.org/community/about/community-guidelines/ List Archives:https://lists.ovirt.org/archives/list/users@ovirt.org/message/O3AO7BB6NGEBHJ...

Sorry here, after some further investigation, this is not using a Gluster share. These particular VMs were set up using an NFS share on a NAS device we have. I don’t know why this was used for these particular devices. Maybe it was a temporary thing and the idea was to move these into the gluster setup eventually. Not sure. At any rate, the paused VMs don’t appear to be using gluster mounts for storage. From: jean-louis@dupond.be (Jean-Louis Dupond) <jean-louis@dupond.be> Sent: Monday, September 13, 2021 1:52 AM To: Strahil Nikolov <hunter86_bg@yahoo.com>; marcel d'heureuse <marcel@deheureu.se>; bob.franzke@mdaemon.com; users@ovirt.org Subject: Re: [ovirt-users] Re: Cannot Start VM After Pausing due to Storage I/O Error Which Qemu version are you using? Cause it might be related to https://bugzilla.redhat.com/show_bug.cgi?id=1994494 Jean-Louis On 10/09/2021 19:59, Strahil Nikolov via Users wrote: Can you provide the output from all nodes: gluster pool list gluster peer status gluster volume status Best Regards, Strahil Nikolov On Fri, Sep 10, 2021 at 0:50, marcel d'heureuse <mailto:marcel@deheureu.se> <marcel@deheureu.se> wrote: _______________________________________________ Users mailing list -- users@ovirt.org <mailto:users@ovirt.org> To unsubscribe send an email to users-leave@ovirt.org <mailto:users-leave@ovirt.org> Privacy Statement: https://www.ovirt.org/privacy-policy.html oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/users@ovirt.org/message/ZNXYN2N2EAVKDT... _______________________________________________ Users mailing list -- users@ovirt.org <mailto:users@ovirt.org> To unsubscribe send an email to users-leave@ovirt.org <mailto:users-leave@ovirt.org> Privacy Statement: https://www.ovirt.org/privacy-policy.html oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/users@ovirt.org/message/O3AO7BB6NGEBHJ...

Thanks for the reply. I ended up moving the disk for another VM that was also paused and using the same NFS share for storage. I moved the disk to a different share, and then moved it back. Then to my surprise, the VM would start up. I then tried to start the VM this thread opened on and it also started up. So this appears to be due to some transient network error not being able to access the NFS share the VM was stored on. I am not sure what sort of checks Ovirt does to determine of the share is accessible, but we have seen this before with our OVIRT setup. Rebooting switches for maintenance for example would pause VMs in our setup. Rebooting the manager server after the network is stabilized seems to fix everything. I looked through the network logs to see what would have caused this but didn’t see any issue with the devices hosting this NFS share. So couldn’t be sure what caused this. It seems OVIRT is very intolerant of ANY network issues whatsoever, even ones designed to improve network stability and availability (spanning tree, failover mechanisms, etc.). In the event of a network situation, why wouldn’t OVIRT try accessing the share again once the network was stabilized. Perhaps this is simply due to poor design of our particular OVIRT installation, but it seems really finicky with ANYTHING changing on the network. If we were to design this again, I wouldn’t rely on any network storage to run VMs and put everything on gluster mounts local to the hosts. would appreciate any ideas on why this might have happened. Thanks for the reply. I’ll keep this link handy for the future on running VMs via command line. Thanks. Bob From: marcel@deheureu.se (marcel d'heureuse) <marcel@deheureu.se> Sent: Thursday, September 9, 2021 4:49 PM To: bob.franzke@mdaemon.com; users@ovirt.org Subject: Re: [ovirt-users] Re: Cannot Start VM After Pausing due to Storage I/O Error Moin, Normally you can use virsh to start vm which are in pause. https://www.google.com/amp/s/www.cyberciti.biz/faq/linux-list-a-kvm-vm-guest... You have to use saslpasswd2 -a libvirt username to get access on comandline. It is importand that you Resum that vm on the same host where it was paused. Good luck... Br Marcel Am 9. September 2021 22:10:52 MESZ schrieb "bob.franzke--- via Users" <users@ovirt.org <mailto:users@ovirt.org> >: OK looking into this further, it seems that I was incorrect about the Gluster situation on these hosts. There are three hosts which form a replicated arbitrated gluster volume between them. There is a second gluster volume which is distributed but only seems to exist on the 4 host (the one the failed VMs are running on. I am not sure why this was set up this way. After all that I looked closer and realized there is an NFS share which these VMs disks are located on. I can access and write to this NFS share from the host itself so I am not sure what the story is as to why I cannot get this VM to come up. Seems like everything is there for it to do so. The error I get when trying to start up the VM is the following: <vm-name> has been paused due to storage I/O problem. How can I determine what this I/O problem actually is? IS the disk file corrupted somehow? Both VMs that won;t start are using this NFS share. Others that are running also use it so I am ot sure what the problem here is or where to start looking for an answer. Thanks in advance for your help. _____ Users mailing list -- users@ovirt.org <mailto:users@ovirt.org> To unsubscribe send an email to users-leave@ovirt.org <mailto:users-leave@ovirt.org> Privacy Statement: https://www.ovirt.org/privacy-policy.html oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/users@ovirt.org/message/EX7TAYPEKKVEXQ...

Incidentally, I also did find this in the OVIRT logs: MainProcess|jsonrpc/7::DEBUG::2021-09-09 15:36:32,791::commands::219::root::(execCmd) FAILED: <err> = 'mount.nfs: No route to host\n'; <rc> = 32 MainProcess|jsonrpc/7::DEBUG::2021-09-09 15:36:32,792::logutils::319::root::(_report_stats) ThreadedHandler is ok in the last 125 seconds (max pending: 0) MainProcess|jsonrpc/7::ERROR::2021-09-09 15:36:32,793::supervdsm_server::103::SuperVdsm.ServerCallback::(wrapper) Error in mount Not sure what it means. ‘No route to host’ is not accurate as the host the NFS mount is hosted on was completely accessible from a network perspective. Not sure if this is relevant or not, but thought I would include it. From: marcel@deheureu.se (marcel d'heureuse) <marcel@deheureu.se> Sent: Thursday, September 9, 2021 4:49 PM To: bob.franzke@mdaemon.com; users@ovirt.org Subject: Re: [ovirt-users] Re: Cannot Start VM After Pausing due to Storage I/O Error Moin, Normally you can use virsh to start vm which are in pause. https://www.google.com/amp/s/www.cyberciti.biz/faq/linux-list-a-kvm-vm-guest... You have to use saslpasswd2 -a libvirt username to get access on comandline. It is importand that you Resum that vm on the same host where it was paused. Good luck... Br Marcel Am 9. September 2021 22:10:52 MESZ schrieb "bob.franzke--- via Users" <users@ovirt.org <mailto:users@ovirt.org> >: OK looking into this further, it seems that I was incorrect about the Gluster situation on these hosts. There are three hosts which form a replicated arbitrated gluster volume between them. There is a second gluster volume which is distributed but only seems to exist on the 4 host (the one the failed VMs are running on. I am not sure why this was set up this way. After all that I looked closer and realized there is an NFS share which these VMs disks are located on. I can access and write to this NFS share from the host itself so I am not sure what the story is as to why I cannot get this VM to come up. Seems like everything is there for it to do so. The error I get when trying to start up the VM is the following: <vm-name> has been paused due to storage I/O problem. How can I determine what this I/O problem actually is? IS the disk file corrupted somehow? Both VMs that won;t start are using this NFS share. Others that are running also use it so I am ot sure what the problem here is or where to start looking for an answer. Thanks in advance for your help. _____ Users mailing list -- users@ovirt.org <mailto:users@ovirt.org> To unsubscribe send an email to users-leave@ovirt.org <mailto:users-leave@ovirt.org> Privacy Statement: https://www.ovirt.org/privacy-policy.html oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/users@ovirt.org/message/EX7TAYPEKKVEXQ...
participants (5)
-
Bob Franzke
-
bob.franzke@mdaemon.com
-
Jean-Louis Dupond
-
marcel d'heureuse
-
Strahil Nikolov