non-operational host issues following 4.2 upgrade

Hi all, I upgraded my 4 host converged gluster/ovirt lab setup to 4.2 yesterday, and now 3 of my hosts won't connect to my main data domain, so they're non-operational when I try to activate them. Here's what seems like a relevant passage of vdsm.log: https://paste.fedoraproject.org/paste/JZuxul6-HZjjl8uHzgqL-w The hosts can mount the gluster storage just fine, I can mount to a test location on the hosts, and I can see that the hosts are mounting the storage in the usual place when they attempt to activate. Permissions look normal, too. I undeployed the hosted engine from the three problem machines, in case that was causing an issue. The hosts are running centos 7. Does any of this ring a bell for anyone? Thanks, Jason

2017-12-21 4:26 GMT+01:00 Jason Brooks <jbrooks@redhat.com>:
Hi all, I upgraded my 4 host converged gluster/ovirt lab setup to 4.2 yesterday, and now 3 of my hosts won't connect to my main data domain, so they're non-operational when I try to activate them.
Here's what seems like a relevant passage of vdsm.log: https://paste.fedoraproject.org/paste/JZuxul6-HZjjl8uHzgqL-w
Adding some relevant developers. Jason, do you mind opening a bug on https://bugzilla.redhat.com/enter_bug.cgi?product=vdsm to track this?
The hosts can mount the gluster storage just fine, I can mount to a test location on the hosts, and I can see that the hosts are mounting the storage in the usual place when they attempt to activate. Permissions look normal, too.
I undeployed the hosted engine from the three problem machines, in case that was causing an issue.
The hosts are running centos 7.
Does any of this ring a bell for anyone?
Thanks, Jason _______________________________________________ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users
-- SANDRO BONAZZOLA ASSOCIATE MANAGER, SOFTWARE ENGINEERING, EMEA ENG VIRTUALIZATION R&D Red Hat EMEA <https://www.redhat.com/> <https://red.ht/sig> TRIED. TESTED. TRUSTED. <https://redhat.com/trusted>

On Wed, Dec 20, 2017 at 11:47 PM, Sandro Bonazzola <sbonazzo@redhat.com> wrote:
2017-12-21 4:26 GMT+01:00 Jason Brooks <jbrooks@redhat.com>:
Hi all, I upgraded my 4 host converged gluster/ovirt lab setup to 4.2 yesterday, and now 3 of my hosts won't connect to my main data domain, so they're non-operational when I try to activate them.
Here's what seems like a relevant passage of vdsm.log: https://paste.fedoraproject.org/paste/JZuxul6-HZjjl8uHzgqL-w
Adding some relevant developers. Jason, do you mind opening a bug on https://bugzilla.redhat.com/enter_bug.cgi?product=vdsm to track this?
I filed an issue here: https://bugzilla.redhat.com/show_bug.cgi?id=1528391

I was able to get my hosts active. During the upgrade, by master data domain's metadata was corrupted -- I had duplicates of some of the dom_md files, and my metadata file was corrupt. Vdsm was looking at that metadata file and throwing up its hands. I added a new data domain but it couldn't take over as master because my old data domain was messed up. I ended up creating a new metadata file in that domain, and my hosts came up. I might be nice to have some way of resetting corrupt metadata or at least of making the error clearer. I did have a gluster hiccup during the upgrade -- the upgrade brought my gluster version from 3.8 to 3.12, and the other peers in the cluster refused connections from my first upgraded host. I upgraded all the others, and got them all talking to each other again, but it may have been during that time that my master data domain metadata became corrupted. I haven't noticed any issues w/ my vms yet, and all through the migration travail, I was able to keep 5 important VMs running. They kept chugging away, even though their host and surrounding hosts were unhealthy. Anyway, I'm back ;) Jason On Thu, Dec 21, 2017 at 9:42 AM, Jason Brooks <jbrooks@redhat.com> wrote:
On Wed, Dec 20, 2017 at 11:47 PM, Sandro Bonazzola <sbonazzo@redhat.com> wrote:
2017-12-21 4:26 GMT+01:00 Jason Brooks <jbrooks@redhat.com>:
Hi all, I upgraded my 4 host converged gluster/ovirt lab setup to 4.2 yesterday, and now 3 of my hosts won't connect to my main data domain, so they're non-operational when I try to activate them.
Here's what seems like a relevant passage of vdsm.log: https://paste.fedoraproject.org/paste/JZuxul6-HZjjl8uHzgqL-w
Adding some relevant developers. Jason, do you mind opening a bug on https://bugzilla.redhat.com/enter_bug.cgi?product=vdsm to track this?
I filed an issue here: https://bugzilla.redhat.com/show_bug.cgi?id=1528391
participants (2)
-
Jason Brooks
-
Sandro Bonazzola