[ovirt-users] New user intro & some questions

Fri Jan 30 02:13:30 UTC 2015

Hello oVirt Users Community,

I've been working with Red Hat and RHEL and clones for about 11 years, 
though I do still consider myself amateur mostly because I'm more of a 
networking guy. :) One-man IT department so I get very little time to 
tinker.

I'm evaluating oVirt (because the boss said no to VMware) and will 
likely begin implementation soon to virtualize our datacenter. So I have 
a SuperMicro Twin2 (4 nodes) system and a cheap managed L2+ switch to 
use for now. Dual 6-core Xeon's and 24GB per node. The two on-board 
82574L's are bonded 802.3ad, no issues there (so far). I currently have 
two 1TB WD RE4 SATA drives configured as RAID1 using the Intel RAID BIOS 
in each node. I understand this is software RAID. That's all working 
fine and I did this so that if a drive dies then I can still boot the 
machine(s). I have a 500MB partition formatted as ext4 for /boot. A 48GB 
ext4 for the root. 24GB for swap. And finally the rest (800-something 
GB) is LVM and XFS for Gluster.

I've been following Jason Brooks' "Up and Running with oVirt" guides 
(which are great, BTW!). I have the cluster up and running with CentOS 7 
and oVirt 3.5, hosted-engine on CentOS 6.6 and CTDB to host a virtual IP 
for the engine NFS mount. There are a couple test VMs running along with 
the engine on various nodes. I found it interesting that I was able to 
upload a ripped ISO of Win 2k3 Enterprise (not SP2) and was able to 
successfully boot it, after which I promptly installed SP2 and oVirt 
guest tools. I do very little with Windows, but there's always that one 
remaining customer that needs IIS and we're not about to buy a new 
Windows Server 2012 license just for them.

So anyway, I'm having a problem with node reboots. They simply will not 
shut down and reboot cleanly. Instead, it looks like they hang after all 
processes are shut down, or at least attempted to be shut down. Then 
after a couple minutes, the hardware watchdog resets the system. I've 
came to the conclusion that sanlock and/or wdmd is causing the hangup. 
I'm guessing an active but non-responsive NFS mount is the culprit, 
possibly the ISO domain NFS mount which is on the engine? I've tried 
manually shutting down all oVirt, VDSM, etc. processes, unmounting all 
NFS shares, but it seems sanlock still has a hold on something in 
/rhev/.. I've Google'd a bit and have come across posts about this as 
well. Any tips here?

Then I experienced something else odd yesterday. I did a yum update for 
the glibc vulnerability stuff. Gluster was updated as well which really 
threw a wrench into things because I wasn't paying attention and quorum 
broke, etc. I got that fixed. Rebooted all nodes (which is when I found 
the sanlock/watchdog problem). Nodes 2, 3 and 4 came back up, but node1 
did not. I logged into the IPKVM console and found that it had no 
network configuration. All /etc/sysconfig/network-scripts/ifcfg-* files 
were gone. I was able to manually reconfigure the physical interfaces, 
set the bonding back up and add the ovirtmgmt bridge. But then the 
engine reported the host as non-operational due to '..does not comply 
with cluster default networks... ovirtmgmt missing' which I was able to 
resolve by reconfiguring the host's network config within the engine GUI 
and all is now well. I'm just curious how/why the ifcfg files were wiped 
out? I haven't touched the network config on any hosts since running 
hosted-engine --deploy.

Please forgive my ignorance and point me to the correct place if these 
issues have been discussed and/or resolved already.

And overall I'm very much liking oVirt, especially as a viable and 
cost-effective alternative to vSphere.

Thanks,
George