Hello Tommy,I had a similar experience and after try to recover my storage domain, I realized that my VMs had missed. You have to verify if your VM disks are inside of your storage domain. In my case, I had to add a new a new Storage domain as Master domain to be able to remove the old VMs from DB and reattach the old storage domain. I hope this were not your case. If you haven't lost your VMs it's possible that you can recover them.Good luck,Juanjo.On Wed, Apr 24, 2013 at 6:43 AM, Tommy McNeely <tommythekid@gmail.com> wrote:
_______________________________________________We had a hard crash (network, then power) on our 2 node Ovirt Cluster. We have NFS datastore on CentOS 6 (3.2.0-1.39.el6). We can no longer get the hosts to activate. They are unable to activate the "master" domain. The master storage domain show "locked" while the other storage domains show Unknown (disks) and inactive (ISO) All the domains are on the same NFS server, we are able to mount it, the permissions are good. We believe we might be getting bit by https://bugzilla.redhat.com/show_bug.cgi?id=920694 or http://gerrit.ovirt.org/#/c/13709/ which says to cease working on it:
Michael Kublin Apr 10 Patch Set 5: Do not submit
Liron, please abondon this work. This interacts with host life cycle which will be changed, during a change a following problem will be solved as well.
So, We were wondering what we can do to get our oVirt back online, or rather what the correct way is to solve this. We have a few VMs that are down which we are looking for ways to recover as quickly as possible.Thanks in advance,TommyHere are the ovirt-engine logs:2013-04-23 21:30:04,041 ERROR [org.ovirt.engine.core.vdsbroker.VDSCommandBase] (pool-3-thread-49) Command ConnectStoragePoolVDS execution failed. Exception: IRSNoMasterDomainException: IRSGenericException: IRSErrorException: IRSNoMasterDomainException: Cannot find master domain: 'spUUID=0f63de0e-7d98-48ce-99ec-add109f83c4f, msdUUID=774e3604-f449-4b3e-8c06-7cd16f98720c'2013-04-23 21:30:04,043 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.ConnectStoragePoolVDSCommand] (pool-3-thread-49) FINISH, ConnectStoragePoolVDSCommand, log id: 50524b342013-04-23 21:30:04,049 WARN [org.ovirt.engine.core.bll.storage.ReconstructMasterDomainCommand] (pool-3-thread-49) [7c5867d6] CanDoAction of action ReconstructMasterDomain failed. Reasons:VAR__ACTION__RECONSTRUCT_MASTER,VAR__TYPE__STORAGE__DOMAIN,ACTION_TYPE_FAILED_STORAGE_DOMAIN_STATUS_ILLEGAL2,$status LockedHere are the logs from vdsm:Thread-29::DEBUG::2013-04-23 21:36:05,906::misc::84::Storage.Misc.excCmd::(<lambda>) '/usr/bin/sudo -n /bin/mount -t nfs -o soft,nosharecache,timeo=600,retrans=6,nfsvers=3 10.101.0.148:/c/vpt1-vmdisks1 /rhev/data-center/mnt/10.101.0.148:_c_vpt1-vmdisks1' (cwd None)Thread-29::DEBUG::2013-04-23 21:36:06,008::misc::84::Storage.Misc.excCmd::(<lambda>) '/usr/bin/sudo -n /bin/mount -t nfs -o soft,nosharecache,timeo=600,retrans=6,nfsvers=3 10.101.0.148:/c/vpool-iso /rhev/data-center/mnt/10.101.0.148:_c_vpool-iso' (cwd None)Thread-29::INFO::2013-04-23 21:36:06,065::logUtils::44::dispatcher::(wrapper) Run and protect: connectStorageServer, Return response: {'statuslist': [{'status': 0, 'id': '7c19bd42-c3dc-41b9-b81b-d9b75214b8dc'}, {'status': 0, 'id': 'eff2ef61-0b12-4429-b087-8742be17ae90'}]}Thread-29::DEBUG::2013-04-23 21:36:06,071::task::1151::TaskManager.Task::(prepare) Task=`48337e40-2446-4357-b6dc-2c86f4da67e2`::finished: {'statuslist': [{'status': 0, 'id': '7c19bd42-c3dc-41b9-b81b-d9b75214b8dc'}, {'status': 0, 'id': 'eff2ef61-0b12-4429-b087-8742be17ae90'}]}Thread-29::DEBUG::2013-04-23 21:36:06,071::task::568::TaskManager.Task::(_updateState) Task=`48337e40-2446-4357-b6dc-2c86f4da67e2`::moving from state preparing -> state finishedThread-29::DEBUG::2013-04-23 21:36:06,071::resourceManager::830::ResourceManager.Owner::(releaseAll) Owner.releaseAll requests {} resources {}Thread-29::DEBUG::2013-04-23 21:36:06,072::resourceManager::864::ResourceManager.Owner::(cancelAll) Owner.cancelAll requests {}Thread-29::DEBUG::2013-04-23 21:36:06,072::task::957::TaskManager.Task::(_decref) Task=`48337e40-2446-4357-b6dc-2c86f4da67e2`::ref 0 aborting FalseThread-30::DEBUG::2013-04-23 21:36:06,112::BindingXMLRPC::161::vds::(wrapper) [10.101.0.197]Thread-30::DEBUG::2013-04-23 21:36:06,112::task::568::TaskManager.Task::(_updateState) Task=`f551fa3f-9d8c-4de3-895a-964c821060d4`::moving from state init -> state preparingThread-30::INFO::2013-04-23 21:36:06,113::logUtils::41::dispatcher::(wrapper) Run and protect: connectStoragePool(spUUID='0f63de0e-7d98-48ce-99ec-add109f83c4f', hostID=1, scsiKey='0f63de0e-7d98-48ce-99ec-add109f83c4f', msdUUID='774e3604-f449-4b3e-8c06-7cd16f98720c', masterVersion=73, options=None)Thread-30::DEBUG::2013-04-23 21:36:06,113::resourceManager::190::ResourceManager.Request::(__init__) ResName=`Storage.0f63de0e-7d98-48ce-99ec-add109f83c4f`ReqID=`ee74329a-0a92-465a-be50-b8acc6d7246a`::Request was made in '/usr/share/vdsm/storage/resourceManager.py' line '189' at '__init__'Thread-30::DEBUG::2013-04-23 21:36:06,114::resourceManager::504::ResourceManager::(registerResource) Trying to register resource 'Storage.0f63de0e-7d98-48ce-99ec-add109f83c4f' for lock type 'exclusive'Thread-30::DEBUG::2013-04-23 21:36:06,114::resourceManager::547::ResourceManager::(registerResource) Resource 'Storage.0f63de0e-7d98-48ce-99ec-add109f83c4f' is free. Now locking as 'exclusive' (1 active user)Thread-30::DEBUG::2013-04-23 21:36:06,114::resourceManager::227::ResourceManager.Request::(grant) ResName=`Storage.0f63de0e-7d98-48ce-99ec-add109f83c4f`ReqID=`ee74329a-0a92-465a-be50-b8acc6d7246a`::Granted requestThread-30::INFO::2013-04-23 21:36:06,115::sp::625::Storage.StoragePool::(connect) Connect host #1 to the storage pool 0f63de0e-7d98-48ce-99ec-add109f83c4f with master domain: 774e3604-f449-4b3e-8c06-7cd16f98720c (ver = 73)Thread-30::DEBUG::2013-04-23 21:36:06,116::lvm::477::OperationMutex::(_invalidateAllPvs) Operation 'lvm invalidate operation' got the operation mutexThread-30::DEBUG::2013-04-23 21:36:06,116::lvm::479::OperationMutex::(_invalidateAllPvs) Operation 'lvm invalidate operation' released the operation mutexThread-30::DEBUG::2013-04-23 21:36:06,117::lvm::488::OperationMutex::(_invalidateAllVgs) Operation 'lvm invalidate operation' got the operation mutexThread-30::DEBUG::2013-04-23 21:36:06,117::lvm::490::OperationMutex::(_invalidateAllVgs) Operation 'lvm invalidate operation' released the operation mutexThread-30::DEBUG::2013-04-23 21:36:06,117::lvm::508::OperationMutex::(_invalidateAllLvs) Operation 'lvm invalidate operation' got the operation mutexThread-30::DEBUG::2013-04-23 21:36:06,118::lvm::510::OperationMutex::(_invalidateAllLvs) Operation 'lvm invalidate operation' released the operation mutexThread-30::DEBUG::2013-04-23 21:36:06,118::misc::1054::SamplingMethod::(__call__) Trying to enter sampling method (storage.sdc.refreshStorage)Thread-30::DEBUG::2013-04-23 21:36:06,118::misc::1056::SamplingMethod::(__call__) Got in to sampling methodThread-30::DEBUG::2013-04-23 21:36:06,119::misc::1054::SamplingMethod::(__call__) Trying to enter sampling method (storage.iscsi.rescan)Thread-30::DEBUG::2013-04-23 21:36:06,119::misc::1056::SamplingMethod::(__call__) Got in to sampling methodThread-30::DEBUG::2013-04-23 21:36:06,119::misc::84::Storage.Misc.excCmd::(<lambda>) '/usr/bin/sudo -n /sbin/iscsiadm -m session -R' (cwd None)Thread-30::DEBUG::2013-04-23 21:36:06,136::misc::84::Storage.Misc.excCmd::(<lambda>) FAILED: <err> = 'iscsiadm: No session found.\n'; <rc> = 21Thread-30::DEBUG::2013-04-23 21:36:06,136::misc::1064::SamplingMethod::(__call__) Returning last resultMainProcess|Thread-30::DEBUG::2013-04-23 21:36:06,139::misc::84::Storage.Misc.excCmd::(<lambda>) '/bin/dd of=/sys/class/scsi_host/host0/scan' (cwd None)MainProcess|Thread-30::DEBUG::2013-04-23 21:36:06,142::misc::84::Storage.Misc.excCmd::(<lambda>) '/bin/dd of=/sys/class/scsi_host/host1/scan' (cwd None)MainProcess|Thread-30::DEBUG::2013-04-23 21:36:06,146::misc::84::Storage.Misc.excCmd::(<lambda>) '/bin/dd of=/sys/class/scsi_host/host2/scan' (cwd None)MainProcess|Thread-30::DEBUG::2013-04-23 21:36:06,149::iscsi::402::Storage.ISCSI::(forceIScsiScan) Performing SCSI scan, this will take up to 30 secondsThread-30::DEBUG::2013-04-23 21:36:08,152::misc::84::Storage.Misc.excCmd::(<lambda>) '/usr/bin/sudo -n /sbin/multipath' (cwd None)Thread-30::DEBUG::2013-04-23 21:36:08,254::misc::84::Storage.Misc.excCmd::(<lambda>) SUCCESS: <err> = ''; <rc> = 0Thread-30::DEBUG::2013-04-23 21:36:08,256::lvm::477::OperationMutex::(_invalidateAllPvs) Operation 'lvm invalidate operation' got the operation mutexThread-30::DEBUG::2013-04-23 21:36:08,256::lvm::479::OperationMutex::(_invalidateAllPvs) Operation 'lvm invalidate operation' released the operation mutexThread-30::DEBUG::2013-04-23 21:36:08,257::lvm::488::OperationMutex::(_invalidateAllVgs) Operation 'lvm invalidate operation' got the operation mutexThread-30::DEBUG::2013-04-23 21:36:08,257::lvm::490::OperationMutex::(_invalidateAllVgs) Operation 'lvm invalidate operation' released the operation mutexThread-30::DEBUG::2013-04-23 21:36:08,258::lvm::508::OperationMutex::(_invalidateAllLvs) Operation 'lvm invalidate operation' got the operation mutexThread-30::DEBUG::2013-04-23 21:36:08,258::lvm::510::OperationMutex::(_invalidateAllLvs) Operation 'lvm invalidate operation' released the operation mutexThread-30::DEBUG::2013-04-23 21:36:08,258::misc::1064::SamplingMethod::(__call__) Returning last resultThread-30::DEBUG::2013-04-23 21:36:08,259::lvm::368::OperationMutex::(_reloadvgs) Operation 'lvm reload operation' got the operation mutexThread-30::DEBUG::2013-04-23 21:36:08,261::misc::84::Storage.Misc.excCmd::(<lambda>) '/usr/bin/sudo -n /sbin/lvm vgs --config " devices { preferred_names = [\\"^/dev/mapper/\\"] ignore_suspended_devices=1 write_cache_state=0 disable_after_error_count=3 filter = [ \\"r%.*%\\" ] } global { locking_type=1 prioritise_write_locks=1 wait_for_locks=1 } backup { retain_min = 50 retain_days = 0 } " --noheadings --units b --nosuffix --separator | -o uuid,name,attr,size,free,extent_size,extent_count,free_count,tags,vg_mda_size,vg_mda_free 774e3604-f449-4b3e-8c06-7cd16f98720c' (cwd None)Thread-30::DEBUG::2013-04-23 21:36:08,514::misc::84::Storage.Misc.excCmd::(<lambda>) FAILED: <err> = ' Volume group "774e3604-f449-4b3e-8c06-7cd16f98720c" not found\n'; <rc> = 5Thread-30::WARNING::2013-04-23 21:36:08,516::lvm::373::Storage.LVM::(_reloadvgs) lvm vgs failed: 5 [] [' Volume group "774e3604-f449-4b3e-8c06-7cd16f98720c" not found']Thread-30::DEBUG::2013-04-23 21:36:08,518::lvm::397::OperationMutex::(_reloadvgs) Operation 'lvm reload operation' released the operation mutexThread-30::DEBUG::2013-04-23 21:36:08,524::resourceManager::557::ResourceManager::(releaseResource) Trying to release resource 'Storage.0f63de0e-7d98-48ce-99ec-add109f83c4f'Thread-30::DEBUG::2013-04-23 21:36:08,525::resourceManager::573::ResourceManager::(releaseResource) Released resource 'Storage.0f63de0e-7d98-48ce-99ec-add109f83c4f' (0 active users)Thread-30::DEBUG::2013-04-23 21:36:08,525::resourceManager::578::ResourceManager::(releaseResource) Resource 'Storage.0f63de0e-7d98-48ce-99ec-add109f83c4f' is free, finding out if anyone is waiting for it.Thread-30::DEBUG::2013-04-23 21:36:08,525::resourceManager::585::ResourceManager::(releaseResource) No one is waiting for resource 'Storage.0f63de0e-7d98-48ce-99ec-add109f83c4f', Clearing records.Thread-30::ERROR::2013-04-23 21:36:08,526::task::833::TaskManager.Task::(_setError) Task=`f551fa3f-9d8c-4de3-895a-964c821060d4`::Unexpected errorTraceback (most recent call last):File "/usr/share/vdsm/storage/task.py", line 840, in _runreturn fn(*args, **kargs)File "/usr/share/vdsm/logUtils.py", line 42, in wrapperres = f(*args, **kwargs)File "/usr/share/vdsm/storage/hsm.py", line 926, in connectStoragePoolmasterVersion, options)File "/usr/share/vdsm/storage/hsm.py", line 973, in _connectStoragePoolres = pool.connect(hostID, scsiKey, msdUUID, masterVersion)File "/usr/share/vdsm/storage/sp.py", line 642, in connectself.__rebuild(msdUUID=msdUUID, masterVersion=masterVersion)File "/usr/share/vdsm/storage/sp.py", line 1166, in __rebuildself.masterDomain = self.getMasterDomain(msdUUID=msdUUID, masterVersion=masterVersion)File "/usr/share/vdsm/storage/sp.py", line 1505, in getMasterDomainraise se.StoragePoolMasterNotFound(self.spUUID, msdUUID)StoragePoolMasterNotFound: Cannot find master domain: 'spUUID=0f63de0e-7d98-48ce-99ec-add109f83c4f, msdUUID=774e3604-f449-4b3e-8c06-7cd16f98720c'Thread-30::DEBUG::2013-04-23 21:36:08,527::task::852::TaskManager.Task::(_run) Task=`f551fa3f-9d8c-4de3-895a-964c821060d4`::Task._run: f551fa3f-9d8c-4de3-895a-964c821060d4 ('0f63de0e-7d98-48ce-99ec-add109f83c4f', 1, '0f63de0e-7d98-48ce-99ec-add109f83c4f', '774e3604-f449-4b3e-8c06-7cd16f98720c', 73) {} failed - stopping taskThread-30::DEBUG::2013-04-23 21:36:08,528::task::1177::TaskManager.Task::(stop) Task=`f551fa3f-9d8c-4de3-895a-964c821060d4`::stopping in state preparing (force False)Thread-30::DEBUG::2013-04-23 21:36:08,528::task::957::TaskManager.Task::(_decref) Task=`f551fa3f-9d8c-4de3-895a-964c821060d4`::ref 1 aborting TrueThread-30::INFO::2013-04-23 21:36:08,528::task::1134::TaskManager.Task::(prepare) Task=`f551fa3f-9d8c-4de3-895a-964c821060d4`::aborting: Task is aborted: 'Cannot find master domain' - code 304Thread-30::DEBUG::2013-04-23 21:36:08,529::task::1139::TaskManager.Task::(prepare) Task=`f551fa3f-9d8c-4de3-895a-964c821060d4`::Prepare: aborted: Cannot find master domainThread-30::DEBUG::2013-04-23 21:36:08,529::task::957::TaskManager.Task::(_decref) Task=`f551fa3f-9d8c-4de3-895a-964c821060d4`::ref 0 aborting TrueThread-30::DEBUG::2013-04-23 21:36:08,529::task::892::TaskManager.Task::(_doAbort) Task=`f551fa3f-9d8c-4de3-895a-964c821060d4`::Task._doAbort: force FalseThread-30::DEBUG::2013-04-23 21:36:08,530::resourceManager::864::ResourceManager.Owner::(cancelAll) Owner.cancelAll requests {}Thread-30::DEBUG::2013-04-23 21:36:08,530::task::568::TaskManager.Task::(_updateState) Task=`f551fa3f-9d8c-4de3-895a-964c821060d4`::moving from state preparing -> state abortingThread-30::DEBUG::2013-04-23 21:36:08,530::task::523::TaskManager.Task::(__state_aborting) Task=`f551fa3f-9d8c-4de3-895a-964c821060d4`::_aborting: recover policy noneThread-30::DEBUG::2013-04-23 21:36:08,531::task::568::TaskManager.Task::(_updateState) Task=`f551fa3f-9d8c-4de3-895a-964c821060d4`::moving from state aborting -> state failedThread-30::DEBUG::2013-04-23 21:36:08,531::resourceManager::830::ResourceManager.Owner::(releaseAll) Owner.releaseAll requests {} resources {}Thread-30::DEBUG::2013-04-23 21:36:08,531::resourceManager::864::ResourceManager.Owner::(cancelAll) Owner.cancelAll requests {}Thread-30::ERROR::2013-04-23 21:36:08,532::dispatcher::67::Storage.Dispatcher.Protect::(run) {'status': {'message': "Cannot find master domain: 'spUUID=0f63de0e-7d98-48ce-99ec-add109f83c4f, msdUUID=774e3604-f449-4b3e-8c06-7cd16f98720c'", 'code': 304}}[root@vmserver3 vdsm]#
Users mailing list
Users@ovirt.org
http://lists.ovirt.org/mailman/listinfo/users