
Noticed I had a VM that was 'paused' due to a 'Storage I/O error. I inherited this system from another admin and have no idea where to start figuring this out. We have a 4-node Ovirt cluster with a 5 Manager node. The VM in question is running on a host vm-host-colo-4. Best I can tell the VMs run on a gluster replicated volume replicated between all 4 nodes, with node 1 running as an arbiter node for the gluster volume. Other VMs are running on this host 4 so not sure what the issue is with this one VM. When I look at the status of the gluster volume for this host, I see the self-heal info for the bricks is listed as 'N/A' for this host. All the other hosts in the cluster list this info as 'OK'. When I cd into the gluster directory on host 4, I don't see the same things as I do on the other hosts. I am not sure this is an issue but its just different. When running various gluster commands gluster seems to respond. See below: [root@vm-host-colo-4 gluster]# gluster volume info all Volume Name: gl-colo-1 Type: Replicate Volume ID: 2c545e19-9468-487e-9e9b-cd3202fc24c4 Status: Started Snapshot Count: 0 Number of Bricks: 1 x (2 + 1) = 3 Transport-type: tcp Bricks: Brick1: 10.20.101.181:/gluster/gl-colo-1/brick1 Brick2: 10.20.101.183:/gluster/gl-colo-1/brick1 Brick3: 10.20.101.185:/gluster/gl-colo-1/brick1 (arbiter) Options Reconfigured: network.ping-timeout: 30 cluster.granular-entry-heal: enable performance.strict-o-direct: on storage.owner-gid: 36 storage.owner-uid: 36 cluster.choose-local: off features.shard: on cluster.shd-wait-qlength: 10000 cluster.shd-max-threads: 8 cluster.locking-scheme: granular cluster.data-self-heal-algorithm: full cluster.server-quorum-type: server cluster.quorum-type: auto cluster.eager-lock: enable network.remote-dio: off performance.low-prio-threads: 32 performance.io-cache: off performance.read-ahead: off performance.quick-read: off auth.allow: * user.cifs: off transport.address-family: inet nfs.disable: on performance.client-io-threads: off Volume Name: gl-vm-host-4 Type: Distribute Volume ID: a2ba6b29-2366-4a7e-bda8-2e0574cf4afa Status: Started Snapshot Count: 0 Number of Bricks: 1 Transport-type: tcp Bricks: Brick1: 10.20.101.187:/gluster/gl-vm-host-colo-4 Options Reconfigured: network.ping-timeout: 30 cluster.granular-entry-heal: enable network.remote-dio: off performance.strict-o-direct: on storage.owner-gid: 36 storage.owner-uid: 36 auth.allow: * user.cifs: disable transport.address-family: inet nfs.disable: on [root@vm-host-colo-4 gluster]# [root@vm-host-colo-4 gluster]# gluster-eventsapi status Webhooks: http://mydesktop.altn.int:80/ovirt-engine/services/glusterevents +-------------------------+-------------+-----------------------+ | NODE | NODE STATUS | GLUSTEREVENTSD STATUS | +-------------------------+-------------+-----------------------+ | vm-host-colo-1.altn.int | UP | OK | | vm-idev-colo-1.altn.int | UP | OK | | vm-host-colo-2.altn.int | UP | OK | | localhost | UP | OK | +-------------------------+-------------+-----------------------+ [root@vm-host-colo-4 gluster]# gluster volume status gl-vm-host-4 Status of volume: gl-vm-host-4 Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Brick 10.20.101.187:/gluster/gl-vm-host-col o-4 49152 0 Y 33221 Task Status of Volume gl-vm-host-4 ------------------------------------------------------------------------------ There are no active volume tasks I also get a timeout error when doing a 'gluster volume status' on this node. So while some aspects seem fine with the gluster volume, some don't. Should I restart the glusterd daemon or will that mess things up? I am not sure if this is due to something wrong with the gluster volume or with the vm-host's ability to access the data for the VM disk, meaning a true I/O problem. There are two VMs in this state both running on this host and I am not sure how to proceed to get them running again. Should I force this VM to be on a different host by editing the VM or should I try and make it work on the host its on. As mentioned, many other VMs are running on this host so not sure why these two have an issue. Up front apologies here. I am a network engineer and not a VM/Ovirt expert. This was dropped in my lap due to a layoff and could use some help on where to go from here. Thanks in advance for any help.