
On 28-04-2015 18:14, Nir Soffer wrote:
The DC storage master domain is on a (unrecoverable) storage on a remote dead host. Engine is automatically setting another storage as the "Data (Master)". Seconds later, the unrecoverable storage is marked as "Data (Master)" again. There is no way to start the Datacenter.
Both storages are gluster. The old (unrecoverable) one worked fine as a master. This may be related to this bug: https://bugzilla.redhat.com/1183977.
Ok. I added a comment and explained more in detail the issue on BZ.
Are you using latest engine?
Yes, ovirt-engine-3.6.0-0.0.master.20150427175110.git61dec8c.el7.centos.noarch
Any hint? If a one gluster node dies, and this brings down your data center, your gluster is probably not set up correctly. With proper replication everything should work after a storage node dies.
Right, in theory vdsm, ovirt-engine and gluster should all be stable enough so that the Master Storage Domain is always alive. Besides, oVirt DC admins should know that a Master Storage Domain can not be removed or firewalled out from the DC without loosing the whole DC. From another point of view, oVirt should be rock solid even in the case Master Storage Domain went down. It should not rely on a single SD but choose other available SD as the new master SD, and that's the way it seems to be implemented (though not always working). Expected result : the alive SD should become the new MSD to reactivate the DC. Issue : Engine tries to set the alive SD as the new MSD but fails without a reason.
Please check this for the recommended configuration: http://www.ovirt.org/Gluster_Storage_Domain_Reference Thanks. Yes, we are applying replica 3 on "production". On our lab, funny things are happening all the time with the master nightly builds and latest gluster builds, but this helps to test and fix issues on the run and generate extreme test cases making oVirt more robust.
Regards, Chris