Hi Juan,

That sounds like a possible path to follow. Our "master" domain does not have any VMs in it. If no one else responds with an official path to resolution, then I will try going into the database and hacking it like that. I think it has something to do with the version or the metadata??

[root@vmserver3 dom_md]# cat metadata 
CLASS=Data
DESCRIPTION=SFOTestMaster1
IOOPTIMEOUTSEC=10
LEASERETRIES=3
LEASETIMESEC=60
LOCKPOLICY=
LOCKRENEWALINTERVALSEC=5
MASTER_VERSION=1
POOL_DESCRIPTION=SFODC01
POOL_DOMAINS=774e3604-f449-4b3e-8c06-7cd16f98720c:Active,758c0abb-ea9a-43fb-bcd9-435f75cd0baa:Active,baa42b1c-ae2e-4486-88a1-e09e1f7a59cb:Active
POOL_SPM_ID=1
POOL_SPM_LVER=4
POOL_UUID=0f63de0e-7d98-48ce-99ec-add109f83c4f
REMOTE_PATH=10.101.0.148:/c/vpt1-master
ROLE=Master
SDUUID=774e3604-f449-4b3e-8c06-7cd16f98720c
TYPE=NFS
VERSION=0
_SHA_CKSUM=fa8ef0e7cd5e50e107384a146e4bfc838d24ba08


On Wed, Apr 24, 2013 at 5:57 AM, Juan Jose <jj197005@gmail.com> wrote:
Hello Tommy,

I had a similar experience and after try to recover my storage domain, I realized that my VMs had missed. You have to verify if your VM disks are inside of your storage domain. In my case, I had to add a new a new Storage domain as Master domain to be able to remove the old VMs from DB and reattach the old storage domain. I hope this were not your case. If you haven't lost your VMs it's possible that you can recover them.

Good luck,

Juanjo.


On Wed, Apr 24, 2013 at 6:43 AM, Tommy McNeely <tommythekid@gmail.com> wrote:

We had a hard crash (network, then power) on our 2 node Ovirt Cluster. We have NFS datastore on CentOS 6 (3.2.0-1.39.el6). We can no longer get the hosts to activate. They are unable to activate the "master" domain. The master storage domain show "locked" while the other storage domains show Unknown (disks) and inactive (ISO) All the domains are on the same NFS server, we are able to mount it, the permissions are good. We believe we might be getting bit by https://bugzilla.redhat.com/show_bug.cgi?id=920694 or http://gerrit.ovirt.org/#/c/13709/ which says to cease working on it:

Michael Kublin Apr 10

Patch Set 5: Do not submit

Liron, please abondon this work. This interacts with host life cycle which will be changed, during a change a following problem will be solved as well.



So, We were wondering what we can do to get our oVirt back online, or rather what the correct way is to solve this. We have a few VMs that are down which we are looking for ways to recover as quickly as possible.

Thanks in advance,
Tommy

Here are the ovirt-engine logs:

2013-04-23 21:30:04,041 ERROR [org.ovirt.engine.core.vdsbroker.VDSCommandBase] (pool-3-thread-49) Command ConnectStoragePoolVDS execution failed. Exception: IRSNoMasterDomainException: IRSGenericException: IRSErrorException: IRSNoMasterDomainException: Cannot find master domain: 'spUUID=0f63de0e-7d98-48ce-99ec-add109f83c4f, msdUUID=774e3604-f449-4b3e-8c06-7cd16f98720c'
2013-04-23 21:30:04,043 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.ConnectStoragePoolVDSCommand] (pool-3-thread-49) FINISH, ConnectStoragePoolVDSCommand, log id: 50524b34
2013-04-23 21:30:04,049 WARN  [org.ovirt.engine.core.bll.storage.ReconstructMasterDomainCommand] (pool-3-thread-49) [7c5867d6] CanDoAction of action ReconstructMasterDomain failed. Reasons:VAR__ACTION__RECONSTRUCT_MASTER,VAR__TYPE__STORAGE__DOMAIN,ACTION_TYPE_FAILED_STORAGE_DOMAIN_STATUS_ILLEGAL2,$status Locked



Here are the logs from vdsm:

Thread-29::DEBUG::2013-04-23 21:36:05,906::misc::84::Storage.Misc.excCmd::(<lambda>) '/usr/bin/sudo -n /bin/mount -t nfs -o soft,nosharecache,timeo=600,retrans=6,nfsvers=3 10.101.0.148:/c/vpt1-vmdisks1 /rhev/data-center/mnt/10.101.0.148:_c_vpt1-vmdisks1' (cwd None)
Thread-29::DEBUG::2013-04-23 21:36:06,008::misc::84::Storage.Misc.excCmd::(<lambda>) '/usr/bin/sudo -n /bin/mount -t nfs -o soft,nosharecache,timeo=600,retrans=6,nfsvers=3 10.101.0.148:/c/vpool-iso /rhev/data-center/mnt/10.101.0.148:_c_vpool-iso' (cwd None)
Thread-29::INFO::2013-04-23 21:36:06,065::logUtils::44::dispatcher::(wrapper) Run and protect: connectStorageServer, Return response: {'statuslist': [{'status': 0, 'id': '7c19bd42-c3dc-41b9-b81b-d9b75214b8dc'}, {'status': 0, 'id': 'eff2ef61-0b12-4429-b087-8742be17ae90'}]}
Thread-29::DEBUG::2013-04-23 21:36:06,071::task::1151::TaskManager.Task::(prepare) Task=`48337e40-2446-4357-b6dc-2c86f4da67e2`::finished: {'statuslist': [{'status': 0, 'id': '7c19bd42-c3dc-41b9-b81b-d9b75214b8dc'}, {'status': 0, 'id': 'eff2ef61-0b12-4429-b087-8742be17ae90'}]}
Thread-29::DEBUG::2013-04-23 21:36:06,071::task::568::TaskManager.Task::(_updateState) Task=`48337e40-2446-4357-b6dc-2c86f4da67e2`::moving from state preparing -> state finished
Thread-29::DEBUG::2013-04-23 21:36:06,071::resourceManager::830::ResourceManager.Owner::(releaseAll) Owner.releaseAll requests {} resources {}
Thread-29::DEBUG::2013-04-23 21:36:06,072::resourceManager::864::ResourceManager.Owner::(cancelAll) Owner.cancelAll requests {}
Thread-29::DEBUG::2013-04-23 21:36:06,072::task::957::TaskManager.Task::(_decref) Task=`48337e40-2446-4357-b6dc-2c86f4da67e2`::ref 0 aborting False
Thread-30::DEBUG::2013-04-23 21:36:06,112::BindingXMLRPC::161::vds::(wrapper) [10.101.0.197]
Thread-30::DEBUG::2013-04-23 21:36:06,112::task::568::TaskManager.Task::(_updateState) Task=`f551fa3f-9d8c-4de3-895a-964c821060d4`::moving from state init -> state preparing
Thread-30::INFO::2013-04-23 21:36:06,113::logUtils::41::dispatcher::(wrapper) Run and protect: connectStoragePool(spUUID='0f63de0e-7d98-48ce-99ec-add109f83c4f', hostID=1, scsiKey='0f63de0e-7d98-48ce-99ec-add109f83c4f', msdUUID='774e3604-f449-4b3e-8c06-7cd16f98720c', masterVersion=73, options=None)
Thread-30::DEBUG::2013-04-23 21:36:06,113::resourceManager::190::ResourceManager.Request::(__init__) ResName=`Storage.0f63de0e-7d98-48ce-99ec-add109f83c4f`ReqID=`ee74329a-0a92-465a-be50-b8acc6d7246a`::Request was made in '/usr/share/vdsm/storage/resourceManager.py' line '189' at '__init__'
Thread-30::DEBUG::2013-04-23 21:36:06,114::resourceManager::504::ResourceManager::(registerResource) Trying to register resource 'Storage.0f63de0e-7d98-48ce-99ec-add109f83c4f' for lock type 'exclusive'
Thread-30::DEBUG::2013-04-23 21:36:06,114::resourceManager::547::ResourceManager::(registerResource) Resource 'Storage.0f63de0e-7d98-48ce-99ec-add109f83c4f' is free. Now locking as 'exclusive' (1 active user)
Thread-30::DEBUG::2013-04-23 21:36:06,114::resourceManager::227::ResourceManager.Request::(grant) ResName=`Storage.0f63de0e-7d98-48ce-99ec-add109f83c4f`ReqID=`ee74329a-0a92-465a-be50-b8acc6d7246a`::Granted request
Thread-30::INFO::2013-04-23 21:36:06,115::sp::625::Storage.StoragePool::(connect) Connect host #1 to the storage pool 0f63de0e-7d98-48ce-99ec-add109f83c4f with master domain: 774e3604-f449-4b3e-8c06-7cd16f98720c (ver = 73)
Thread-30::DEBUG::2013-04-23 21:36:06,116::lvm::477::OperationMutex::(_invalidateAllPvs) Operation 'lvm invalidate operation' got the operation mutex
Thread-30::DEBUG::2013-04-23 21:36:06,116::lvm::479::OperationMutex::(_invalidateAllPvs) Operation 'lvm invalidate operation' released the operation mutex
Thread-30::DEBUG::2013-04-23 21:36:06,117::lvm::488::OperationMutex::(_invalidateAllVgs) Operation 'lvm invalidate operation' got the operation mutex
Thread-30::DEBUG::2013-04-23 21:36:06,117::lvm::490::OperationMutex::(_invalidateAllVgs) Operation 'lvm invalidate operation' released the operation mutex
Thread-30::DEBUG::2013-04-23 21:36:06,117::lvm::508::OperationMutex::(_invalidateAllLvs) Operation 'lvm invalidate operation' got the operation mutex
Thread-30::DEBUG::2013-04-23 21:36:06,118::lvm::510::OperationMutex::(_invalidateAllLvs) Operation 'lvm invalidate operation' released the operation mutex
Thread-30::DEBUG::2013-04-23 21:36:06,118::misc::1054::SamplingMethod::(__call__) Trying to enter sampling method (storage.sdc.refreshStorage)
Thread-30::DEBUG::2013-04-23 21:36:06,118::misc::1056::SamplingMethod::(__call__) Got in to sampling method
Thread-30::DEBUG::2013-04-23 21:36:06,119::misc::1054::SamplingMethod::(__call__) Trying to enter sampling method (storage.iscsi.rescan)
Thread-30::DEBUG::2013-04-23 21:36:06,119::misc::1056::SamplingMethod::(__call__) Got in to sampling method
Thread-30::DEBUG::2013-04-23 21:36:06,119::misc::84::Storage.Misc.excCmd::(<lambda>) '/usr/bin/sudo -n /sbin/iscsiadm -m session -R' (cwd None)
Thread-30::DEBUG::2013-04-23 21:36:06,136::misc::84::Storage.Misc.excCmd::(<lambda>) FAILED: <err> = 'iscsiadm: No session found.\n'; <rc> = 21
Thread-30::DEBUG::2013-04-23 21:36:06,136::misc::1064::SamplingMethod::(__call__) Returning last result
MainProcess|Thread-30::DEBUG::2013-04-23 21:36:06,139::misc::84::Storage.Misc.excCmd::(<lambda>) '/bin/dd of=/sys/class/scsi_host/host0/scan' (cwd None)
MainProcess|Thread-30::DEBUG::2013-04-23 21:36:06,142::misc::84::Storage.Misc.excCmd::(<lambda>) '/bin/dd of=/sys/class/scsi_host/host1/scan' (cwd None)
MainProcess|Thread-30::DEBUG::2013-04-23 21:36:06,146::misc::84::Storage.Misc.excCmd::(<lambda>) '/bin/dd of=/sys/class/scsi_host/host2/scan' (cwd None)
MainProcess|Thread-30::DEBUG::2013-04-23 21:36:06,149::iscsi::402::Storage.ISCSI::(forceIScsiScan) Performing SCSI scan, this will take up to 30 seconds
Thread-30::DEBUG::2013-04-23 21:36:08,152::misc::84::Storage.Misc.excCmd::(<lambda>) '/usr/bin/sudo -n /sbin/multipath' (cwd None)
Thread-30::DEBUG::2013-04-23 21:36:08,254::misc::84::Storage.Misc.excCmd::(<lambda>) SUCCESS: <err> = ''; <rc> = 0
Thread-30::DEBUG::2013-04-23 21:36:08,256::lvm::477::OperationMutex::(_invalidateAllPvs) Operation 'lvm invalidate operation' got the operation mutex
Thread-30::DEBUG::2013-04-23 21:36:08,256::lvm::479::OperationMutex::(_invalidateAllPvs) Operation 'lvm invalidate operation' released the operation mutex
Thread-30::DEBUG::2013-04-23 21:36:08,257::lvm::488::OperationMutex::(_invalidateAllVgs) Operation 'lvm invalidate operation' got the operation mutex
Thread-30::DEBUG::2013-04-23 21:36:08,257::lvm::490::OperationMutex::(_invalidateAllVgs) Operation 'lvm invalidate operation' released the operation mutex
Thread-30::DEBUG::2013-04-23 21:36:08,258::lvm::508::OperationMutex::(_invalidateAllLvs) Operation 'lvm invalidate operation' got the operation mutex
Thread-30::DEBUG::2013-04-23 21:36:08,258::lvm::510::OperationMutex::(_invalidateAllLvs) Operation 'lvm invalidate operation' released the operation mutex
Thread-30::DEBUG::2013-04-23 21:36:08,258::misc::1064::SamplingMethod::(__call__) Returning last result
Thread-30::DEBUG::2013-04-23 21:36:08,259::lvm::368::OperationMutex::(_reloadvgs) Operation 'lvm reload operation' got the operation mutex
Thread-30::DEBUG::2013-04-23 21:36:08,261::misc::84::Storage.Misc.excCmd::(<lambda>) '/usr/bin/sudo -n /sbin/lvm vgs --config " devices { preferred_names = [\\"^/dev/mapper/\\"] ignore_suspended_devices=1 write_cache_state=0 disable_after_error_count=3 filter = [ \\"r%.*%\\" ] }  global {  locking_type=1  prioritise_write_locks=1  wait_for_locks=1 }  backup {  retain_min = 50  retain_days = 0 } " --noheadings --units b --nosuffix --separator | -o uuid,name,attr,size,free,extent_size,extent_count,free_count,tags,vg_mda_size,vg_mda_free 774e3604-f449-4b3e-8c06-7cd16f98720c' (cwd None)
Thread-30::DEBUG::2013-04-23 21:36:08,514::misc::84::Storage.Misc.excCmd::(<lambda>) FAILED: <err> = '  Volume group "774e3604-f449-4b3e-8c06-7cd16f98720c" not found\n'; <rc> = 5
Thread-30::WARNING::2013-04-23 21:36:08,516::lvm::373::Storage.LVM::(_reloadvgs) lvm vgs failed: 5 [] ['  Volume group "774e3604-f449-4b3e-8c06-7cd16f98720c" not found']
Thread-30::DEBUG::2013-04-23 21:36:08,518::lvm::397::OperationMutex::(_reloadvgs) Operation 'lvm reload operation' released the operation mutex
Thread-30::DEBUG::2013-04-23 21:36:08,524::resourceManager::557::ResourceManager::(releaseResource) Trying to release resource 'Storage.0f63de0e-7d98-48ce-99ec-add109f83c4f'
Thread-30::DEBUG::2013-04-23 21:36:08,525::resourceManager::573::ResourceManager::(releaseResource) Released resource 'Storage.0f63de0e-7d98-48ce-99ec-add109f83c4f' (0 active users)
Thread-30::DEBUG::2013-04-23 21:36:08,525::resourceManager::578::ResourceManager::(releaseResource) Resource 'Storage.0f63de0e-7d98-48ce-99ec-add109f83c4f' is free, finding out if anyone is waiting for it.
Thread-30::DEBUG::2013-04-23 21:36:08,525::resourceManager::585::ResourceManager::(releaseResource) No one is waiting for resource 'Storage.0f63de0e-7d98-48ce-99ec-add109f83c4f', Clearing records.
Thread-30::ERROR::2013-04-23 21:36:08,526::task::833::TaskManager.Task::(_setError) Task=`f551fa3f-9d8c-4de3-895a-964c821060d4`::Unexpected error
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/task.py", line 840, in _run
    return fn(*args, **kargs)
  File "/usr/share/vdsm/logUtils.py", line 42, in wrapper
    res = f(*args, **kwargs)
  File "/usr/share/vdsm/storage/hsm.py", line 926, in connectStoragePool
    masterVersion, options)
  File "/usr/share/vdsm/storage/hsm.py", line 973, in _connectStoragePool
    res = pool.connect(hostID, scsiKey, msdUUID, masterVersion)
  File "/usr/share/vdsm/storage/sp.py", line 642, in connect
    self.__rebuild(msdUUID=msdUUID, masterVersion=masterVersion)
  File "/usr/share/vdsm/storage/sp.py", line 1166, in __rebuild
    self.masterDomain = self.getMasterDomain(msdUUID=msdUUID, masterVersion=masterVersion)
  File "/usr/share/vdsm/storage/sp.py", line 1505, in getMasterDomain
    raise se.StoragePoolMasterNotFound(self.spUUID, msdUUID)
StoragePoolMasterNotFound: Cannot find master domain: 'spUUID=0f63de0e-7d98-48ce-99ec-add109f83c4f, msdUUID=774e3604-f449-4b3e-8c06-7cd16f98720c'
Thread-30::DEBUG::2013-04-23 21:36:08,527::task::852::TaskManager.Task::(_run) Task=`f551fa3f-9d8c-4de3-895a-964c821060d4`::Task._run: f551fa3f-9d8c-4de3-895a-964c821060d4 ('0f63de0e-7d98-48ce-99ec-add109f83c4f', 1, '0f63de0e-7d98-48ce-99ec-add109f83c4f', '774e3604-f449-4b3e-8c06-7cd16f98720c', 73) {} failed - stopping task
Thread-30::DEBUG::2013-04-23 21:36:08,528::task::1177::TaskManager.Task::(stop) Task=`f551fa3f-9d8c-4de3-895a-964c821060d4`::stopping in state preparing (force False)
Thread-30::DEBUG::2013-04-23 21:36:08,528::task::957::TaskManager.Task::(_decref) Task=`f551fa3f-9d8c-4de3-895a-964c821060d4`::ref 1 aborting True
Thread-30::INFO::2013-04-23 21:36:08,528::task::1134::TaskManager.Task::(prepare) Task=`f551fa3f-9d8c-4de3-895a-964c821060d4`::aborting: Task is aborted: 'Cannot find master domain' - code 304
Thread-30::DEBUG::2013-04-23 21:36:08,529::task::1139::TaskManager.Task::(prepare) Task=`f551fa3f-9d8c-4de3-895a-964c821060d4`::Prepare: aborted: Cannot find master domain
Thread-30::DEBUG::2013-04-23 21:36:08,529::task::957::TaskManager.Task::(_decref) Task=`f551fa3f-9d8c-4de3-895a-964c821060d4`::ref 0 aborting True
Thread-30::DEBUG::2013-04-23 21:36:08,529::task::892::TaskManager.Task::(_doAbort) Task=`f551fa3f-9d8c-4de3-895a-964c821060d4`::Task._doAbort: force False
Thread-30::DEBUG::2013-04-23 21:36:08,530::resourceManager::864::ResourceManager.Owner::(cancelAll) Owner.cancelAll requests {}
Thread-30::DEBUG::2013-04-23 21:36:08,530::task::568::TaskManager.Task::(_updateState) Task=`f551fa3f-9d8c-4de3-895a-964c821060d4`::moving from state preparing -> state aborting
Thread-30::DEBUG::2013-04-23 21:36:08,530::task::523::TaskManager.Task::(__state_aborting) Task=`f551fa3f-9d8c-4de3-895a-964c821060d4`::_aborting: recover policy none
Thread-30::DEBUG::2013-04-23 21:36:08,531::task::568::TaskManager.Task::(_updateState) Task=`f551fa3f-9d8c-4de3-895a-964c821060d4`::moving from state aborting -> state failed
Thread-30::DEBUG::2013-04-23 21:36:08,531::resourceManager::830::ResourceManager.Owner::(releaseAll) Owner.releaseAll requests {} resources {}
Thread-30::DEBUG::2013-04-23 21:36:08,531::resourceManager::864::ResourceManager.Owner::(cancelAll) Owner.cancelAll requests {}
Thread-30::ERROR::2013-04-23 21:36:08,532::dispatcher::67::Storage.Dispatcher.Protect::(run) {'status': {'message': "Cannot find master domain: 'spUUID=0f63de0e-7d98-48ce-99ec-add109f83c4f, msdUUID=774e3604-f449-4b3e-8c06-7cd16f98720c'", 'code': 304}}
[root@vmserver3 vdsm]# 


_______________________________________________
Users mailing list
Users@ovirt.org
http://lists.ovirt.org/mailman/listinfo/users