Hi,
So, on Thursday we had the worst scenario occur. All hosts in the 4 node cluster we've
had these issues with went non responsive and starting looping through various states.
Spread across these hosts are the 45 guests we have the 45 storage domains for. As we have
a responsibility to the end users, we had to make the decision to stop trying to bring
this cluster online and scrap it based on the information you've provided. We've
now split the cluster in half and created two clusters with the guests spread between them
(around 20 on each). I've also taken the step of starting to present a few 2 TB
storage domains and am migrating the guest disks from their individual storage domains
onto grouped shared domains.
This immediately reduces the number of storage domains by half on the clusters and will
reduce it further as we consolidate the storage. We obviously still have the same number
of guest disks so will still have a large number of logical volumes, we just reduce the
number of physical volumes presented to each host (and storage domains within Ovirt).
We'll just have to see if that improves things.
Thanks for your assistance and focus with the problem and I'm glad we helped squash at
least one bug. I would have liked to actually get to the bottom of the problem with that
specific cluster, but events took a turn for the worse and forced our hand.
At the moment the clusters are both behaving but it's early days yet. We haven't
changed any of the iSCSI settings on the new clusters but we have kept the modified
monitor.py.
Regards,
Mark