Hi Juan,
That sounds like a possible path to follow. Our "master" domain does not
have any VMs in it. If no one else responds with an official path to
resolution, then I will try going into the database and hacking it like
that. I think it has something to do with the version or the metadata??
[root@vmserver3 dom_md]# cat metadata
CLASS=Data
DESCRIPTION=SFOTestMaster1
IOOPTIMEOUTSEC=10
LEASERETRIES=3
LEASETIMESEC=60
LOCKPOLICY=
LOCKRENEWALINTERVALSEC=5
MASTER_VERSION=1
POOL_DESCRIPTION=SFODC01
POOL_DOMAINS=774e3604-f449-4b3e-8c06-7cd16f98720c:Active,758c0abb-ea9a-43fb-bcd9-435f75cd0baa:Active,baa42b1c-ae2e-4486-88a1-e09e1f7a59cb:Active
POOL_SPM_ID=1
POOL_SPM_LVER=4
POOL_UUID=0f63de0e-7d98-48ce-99ec-add109f83c4f
REMOTE_PATH=10.101.0.148:/c/vpt1-master
ROLE=Master
SDUUID=774e3604-f449-4b3e-8c06-7cd16f98720c
TYPE=NFS
VERSION=0
_SHA_CKSUM=fa8ef0e7cd5e50e107384a146e4bfc838d24ba08
On Wed, Apr 24, 2013 at 5:57 AM, Juan Jose <jj197005(a)gmail.com> wrote:
Hello Tommy,
I had a similar experience and after try to recover my storage domain, I
realized that my VMs had missed. You have to verify if your VM disks are
inside of your storage domain. In my case, I had to add a new a new Storage
domain as Master domain to be able to remove the old VMs from DB and
reattach the old storage domain. I hope this were not your case. If you
haven't lost your VMs it's possible that you can recover them.
Good luck,
Juanjo.
On Wed, Apr 24, 2013 at 6:43 AM, Tommy McNeely <tommythekid(a)gmail.com>wrote:
>
> We had a hard crash (network, then power) on our 2 node Ovirt Cluster. We
> have NFS datastore on CentOS 6 (3.2.0-1.39.el6). We can no longer get the
> hosts to activate. They are unable to activate the "master" domain. The
> master storage domain show "locked" while the other storage domains show
> Unknown (disks) and inactive (ISO) All the domains are on the same NFS
> server, we are able to mount it, the permissions are good. We believe we
> might be getting bit by
>
https://bugzilla.redhat.com/show_bug.cgi?id=920694 or
>
http://gerrit.ovirt.org/#/c/13709/ which says to cease working on it:
>
> Michael Kublin Apr 10
>
> Patch Set 5: Do not submit
>
> Liron, please abondon this work. This interacts with host life cycle
> which will be changed, during a change a following problem will be solved
> as well.
>
>
> So, We were wondering what we can do to get our oVirt back online, or
> rather what the correct way is to solve this. We have a few VMs that are
> down which we are looking for ways to recover as quickly as possible.
>
> Thanks in advance,
> Tommy
>
> Here are the ovirt-engine logs:
>
> 2013-04-23 21:30:04,041 ERROR
> [org.ovirt.engine.core.vdsbroker.VDSCommandBase] (pool-3-thread-49) Command
> ConnectStoragePoolVDS execution failed. Exception:
> IRSNoMasterDomainException: IRSGenericException: IRSErrorException:
> IRSNoMasterDomainException: Cannot find master domain:
> 'spUUID=0f63de0e-7d98-48ce-99ec-add109f83c4f,
> msdUUID=774e3604-f449-4b3e-8c06-7cd16f98720c'
> 2013-04-23 21:30:04,043 INFO
> [org.ovirt.engine.core.vdsbroker.vdsbroker.ConnectStoragePoolVDSCommand]
> (pool-3-thread-49) FINISH, ConnectStoragePoolVDSCommand, log id: 50524b34
> 2013-04-23 21:30:04,049 WARN
> [org.ovirt.engine.core.bll.storage.ReconstructMasterDomainCommand]
> (pool-3-thread-49) [7c5867d6] CanDoAction of action ReconstructMasterDomain
> failed.
>
Reasons:VAR__ACTION__RECONSTRUCT_MASTER,VAR__TYPE__STORAGE__DOMAIN,ACTION_TYPE_FAILED_STORAGE_DOMAIN_STATUS_ILLEGAL2,$status
> Locked
>
>
>
> Here are the logs from vdsm:
>
> Thread-29::DEBUG::2013-04-23
> 21:36:05,906::misc::84::Storage.Misc.excCmd::(<lambda>) '/usr/bin/sudo -n
> /bin/mount -t nfs -o soft,nosharecache,timeo=600,retrans=6,nfsvers=3
> 10.101.0.148:/c/vpt1-vmdisks1
/rhev/data-center/mnt/10.101.0.148:_c_vpt1-vmdisks1'
> (cwd None)
> Thread-29::DEBUG::2013-04-23
> 21:36:06,008::misc::84::Storage.Misc.excCmd::(<lambda>) '/usr/bin/sudo -n
> /bin/mount -t nfs -o soft,nosharecache,timeo=600,retrans=6,nfsvers=3
> 10.101.0.148:/c/vpool-iso /rhev/data-center/mnt/10.101.0.148:_c_vpool-iso'
> (cwd None)
> Thread-29::INFO::2013-04-23
> 21:36:06,065::logUtils::44::dispatcher::(wrapper) Run and protect:
> connectStorageServer, Return response: {'statuslist': [{'status': 0,
'id':
> '7c19bd42-c3dc-41b9-b81b-d9b75214b8dc'}, {'status': 0, 'id':
> 'eff2ef61-0b12-4429-b087-8742be17ae90'}]}
> Thread-29::DEBUG::2013-04-23
> 21:36:06,071::task::1151::TaskManager.Task::(prepare)
> Task=`48337e40-2446-4357-b6dc-2c86f4da67e2`::finished: {'statuslist':
> [{'status': 0, 'id': '7c19bd42-c3dc-41b9-b81b-d9b75214b8dc'},
{'status': 0,
> 'id': 'eff2ef61-0b12-4429-b087-8742be17ae90'}]}
> Thread-29::DEBUG::2013-04-23
> 21:36:06,071::task::568::TaskManager.Task::(_updateState)
> Task=`48337e40-2446-4357-b6dc-2c86f4da67e2`::moving from state preparing ->
> state finished
> Thread-29::DEBUG::2013-04-23
> 21:36:06,071::resourceManager::830::ResourceManager.Owner::(releaseAll)
> Owner.releaseAll requests {} resources {}
> Thread-29::DEBUG::2013-04-23
> 21:36:06,072::resourceManager::864::ResourceManager.Owner::(cancelAll)
> Owner.cancelAll requests {}
> Thread-29::DEBUG::2013-04-23
> 21:36:06,072::task::957::TaskManager.Task::(_decref)
> Task=`48337e40-2446-4357-b6dc-2c86f4da67e2`::ref 0 aborting False
> Thread-30::DEBUG::2013-04-23
> 21:36:06,112::BindingXMLRPC::161::vds::(wrapper) [10.101.0.197]
> Thread-30::DEBUG::2013-04-23
> 21:36:06,112::task::568::TaskManager.Task::(_updateState)
> Task=`f551fa3f-9d8c-4de3-895a-964c821060d4`::moving from state init ->
> state preparing
> Thread-30::INFO::2013-04-23
> 21:36:06,113::logUtils::41::dispatcher::(wrapper) Run and protect:
> connectStoragePool(spUUID='0f63de0e-7d98-48ce-99ec-add109f83c4f', hostID=1,
> scsiKey='0f63de0e-7d98-48ce-99ec-add109f83c4f',
> msdUUID='774e3604-f449-4b3e-8c06-7cd16f98720c', masterVersion=73,
> options=None)
> Thread-30::DEBUG::2013-04-23
> 21:36:06,113::resourceManager::190::ResourceManager.Request::(__init__)
>
ResName=`Storage.0f63de0e-7d98-48ce-99ec-add109f83c4f`ReqID=`ee74329a-0a92-465a-be50-b8acc6d7246a`::Request
> was made in '/usr/share/vdsm/storage/resourceManager.py' line '189'
at
> '__init__'
> Thread-30::DEBUG::2013-04-23
> 21:36:06,114::resourceManager::504::ResourceManager::(registerResource)
> Trying to register resource 'Storage.0f63de0e-7d98-48ce-99ec-add109f83c4f'
> for lock type 'exclusive'
> Thread-30::DEBUG::2013-04-23
> 21:36:06,114::resourceManager::547::ResourceManager::(registerResource)
> Resource 'Storage.0f63de0e-7d98-48ce-99ec-add109f83c4f' is free. Now
> locking as 'exclusive' (1 active user)
> Thread-30::DEBUG::2013-04-23
> 21:36:06,114::resourceManager::227::ResourceManager.Request::(grant)
>
ResName=`Storage.0f63de0e-7d98-48ce-99ec-add109f83c4f`ReqID=`ee74329a-0a92-465a-be50-b8acc6d7246a`::Granted
> request
> Thread-30::INFO::2013-04-23
> 21:36:06,115::sp::625::Storage.StoragePool::(connect) Connect host #1 to
> the storage pool 0f63de0e-7d98-48ce-99ec-add109f83c4f with master domain:
> 774e3604-f449-4b3e-8c06-7cd16f98720c (ver = 73)
> Thread-30::DEBUG::2013-04-23
> 21:36:06,116::lvm::477::OperationMutex::(_invalidateAllPvs) Operation 'lvm
> invalidate operation' got the operation mutex
> Thread-30::DEBUG::2013-04-23
> 21:36:06,116::lvm::479::OperationMutex::(_invalidateAllPvs) Operation 'lvm
> invalidate operation' released the operation mutex
> Thread-30::DEBUG::2013-04-23
> 21:36:06,117::lvm::488::OperationMutex::(_invalidateAllVgs) Operation 'lvm
> invalidate operation' got the operation mutex
> Thread-30::DEBUG::2013-04-23
> 21:36:06,117::lvm::490::OperationMutex::(_invalidateAllVgs) Operation 'lvm
> invalidate operation' released the operation mutex
> Thread-30::DEBUG::2013-04-23
> 21:36:06,117::lvm::508::OperationMutex::(_invalidateAllLvs) Operation 'lvm
> invalidate operation' got the operation mutex
> Thread-30::DEBUG::2013-04-23
> 21:36:06,118::lvm::510::OperationMutex::(_invalidateAllLvs) Operation 'lvm
> invalidate operation' released the operation mutex
> Thread-30::DEBUG::2013-04-23
> 21:36:06,118::misc::1054::SamplingMethod::(__call__) Trying to enter
> sampling method (storage.sdc.refreshStorage)
> Thread-30::DEBUG::2013-04-23
> 21:36:06,118::misc::1056::SamplingMethod::(__call__) Got in to sampling
> method
> Thread-30::DEBUG::2013-04-23
> 21:36:06,119::misc::1054::SamplingMethod::(__call__) Trying to enter
> sampling method (storage.iscsi.rescan)
> Thread-30::DEBUG::2013-04-23
> 21:36:06,119::misc::1056::SamplingMethod::(__call__) Got in to sampling
> method
> Thread-30::DEBUG::2013-04-23
> 21:36:06,119::misc::84::Storage.Misc.excCmd::(<lambda>) '/usr/bin/sudo -n
> /sbin/iscsiadm -m session -R' (cwd None)
> Thread-30::DEBUG::2013-04-23
> 21:36:06,136::misc::84::Storage.Misc.excCmd::(<lambda>) FAILED: <err> =
> 'iscsiadm: No session found.\n'; <rc> = 21
> Thread-30::DEBUG::2013-04-23
> 21:36:06,136::misc::1064::SamplingMethod::(__call__) Returning last result
> MainProcess|Thread-30::DEBUG::2013-04-23
> 21:36:06,139::misc::84::Storage.Misc.excCmd::(<lambda>) '/bin/dd
> of=/sys/class/scsi_host/host0/scan' (cwd None)
> MainProcess|Thread-30::DEBUG::2013-04-23
> 21:36:06,142::misc::84::Storage.Misc.excCmd::(<lambda>) '/bin/dd
> of=/sys/class/scsi_host/host1/scan' (cwd None)
> MainProcess|Thread-30::DEBUG::2013-04-23
> 21:36:06,146::misc::84::Storage.Misc.excCmd::(<lambda>) '/bin/dd
> of=/sys/class/scsi_host/host2/scan' (cwd None)
> MainProcess|Thread-30::DEBUG::2013-04-23
> 21:36:06,149::iscsi::402::Storage.ISCSI::(forceIScsiScan) Performing SCSI
> scan, this will take up to 30 seconds
> Thread-30::DEBUG::2013-04-23
> 21:36:08,152::misc::84::Storage.Misc.excCmd::(<lambda>) '/usr/bin/sudo -n
> /sbin/multipath' (cwd None)
> Thread-30::DEBUG::2013-04-23
> 21:36:08,254::misc::84::Storage.Misc.excCmd::(<lambda>) SUCCESS: <err> =
> ''; <rc> = 0
> Thread-30::DEBUG::2013-04-23
> 21:36:08,256::lvm::477::OperationMutex::(_invalidateAllPvs) Operation 'lvm
> invalidate operation' got the operation mutex
> Thread-30::DEBUG::2013-04-23
> 21:36:08,256::lvm::479::OperationMutex::(_invalidateAllPvs) Operation 'lvm
> invalidate operation' released the operation mutex
> Thread-30::DEBUG::2013-04-23
> 21:36:08,257::lvm::488::OperationMutex::(_invalidateAllVgs) Operation 'lvm
> invalidate operation' got the operation mutex
> Thread-30::DEBUG::2013-04-23
> 21:36:08,257::lvm::490::OperationMutex::(_invalidateAllVgs) Operation 'lvm
> invalidate operation' released the operation mutex
> Thread-30::DEBUG::2013-04-23
> 21:36:08,258::lvm::508::OperationMutex::(_invalidateAllLvs) Operation 'lvm
> invalidate operation' got the operation mutex
> Thread-30::DEBUG::2013-04-23
> 21:36:08,258::lvm::510::OperationMutex::(_invalidateAllLvs) Operation 'lvm
> invalidate operation' released the operation mutex
> Thread-30::DEBUG::2013-04-23
> 21:36:08,258::misc::1064::SamplingMethod::(__call__) Returning last result
> Thread-30::DEBUG::2013-04-23
> 21:36:08,259::lvm::368::OperationMutex::(_reloadvgs) Operation 'lvm reload
> operation' got the operation mutex
> Thread-30::DEBUG::2013-04-23
> 21:36:08,261::misc::84::Storage.Misc.excCmd::(<lambda>) '/usr/bin/sudo -n
> /sbin/lvm vgs --config " devices { preferred_names =
[\\"^/dev/mapper/\\"]
> ignore_suspended_devices=1 write_cache_state=0 disable_after_error_count=3
> filter = [ \\"r%.*%\\" ] } global { locking_type=1
> prioritise_write_locks=1 wait_for_locks=1 } backup { retain_min = 50
> retain_days = 0 } " --noheadings --units b --nosuffix --separator | -o
>
uuid,name,attr,size,free,extent_size,extent_count,free_count,tags,vg_mda_size,vg_mda_free
> 774e3604-f449-4b3e-8c06-7cd16f98720c' (cwd None)
> Thread-30::DEBUG::2013-04-23
> 21:36:08,514::misc::84::Storage.Misc.excCmd::(<lambda>) FAILED: <err> =
'
> Volume group "774e3604-f449-4b3e-8c06-7cd16f98720c" not found\n';
<rc> = 5
> Thread-30::WARNING::2013-04-23
> 21:36:08,516::lvm::373::Storage.LVM::(_reloadvgs) lvm vgs failed: 5 [] ['
> Volume group "774e3604-f449-4b3e-8c06-7cd16f98720c" not found']
> Thread-30::DEBUG::2013-04-23
> 21:36:08,518::lvm::397::OperationMutex::(_reloadvgs) Operation 'lvm reload
> operation' released the operation mutex
> Thread-30::DEBUG::2013-04-23
> 21:36:08,524::resourceManager::557::ResourceManager::(releaseResource)
> Trying to release resource 'Storage.0f63de0e-7d98-48ce-99ec-add109f83c4f'
> Thread-30::DEBUG::2013-04-23
> 21:36:08,525::resourceManager::573::ResourceManager::(releaseResource)
> Released resource 'Storage.0f63de0e-7d98-48ce-99ec-add109f83c4f' (0 active
> users)
> Thread-30::DEBUG::2013-04-23
> 21:36:08,525::resourceManager::578::ResourceManager::(releaseResource)
> Resource 'Storage.0f63de0e-7d98-48ce-99ec-add109f83c4f' is free, finding
> out if anyone is waiting for it.
> Thread-30::DEBUG::2013-04-23
> 21:36:08,525::resourceManager::585::ResourceManager::(releaseResource) No
> one is waiting for resource 'Storage.0f63de0e-7d98-48ce-99ec-add109f83c4f',
> Clearing records.
> Thread-30::ERROR::2013-04-23
> 21:36:08,526::task::833::TaskManager.Task::(_setError)
> Task=`f551fa3f-9d8c-4de3-895a-964c821060d4`::Unexpected error
> Traceback (most recent call last):
> File "/usr/share/vdsm/storage/task.py", line 840, in _run
> return fn(*args, **kargs)
> File "/usr/share/vdsm/logUtils.py", line 42, in wrapper
> res = f(*args, **kwargs)
> File "/usr/share/vdsm/storage/hsm.py", line 926, in connectStoragePool
> masterVersion, options)
> File "/usr/share/vdsm/storage/hsm.py", line 973, in _connectStoragePool
> res = pool.connect(hostID, scsiKey, msdUUID, masterVersion)
> File "/usr/share/vdsm/storage/sp.py", line 642, in connect
> self.__rebuild(msdUUID=msdUUID, masterVersion=masterVersion)
> File "/usr/share/vdsm/storage/sp.py", line 1166, in __rebuild
> self.masterDomain = self.getMasterDomain(msdUUID=msdUUID,
> masterVersion=masterVersion)
> File "/usr/share/vdsm/storage/sp.py", line 1505, in getMasterDomain
> raise se.StoragePoolMasterNotFound(self.spUUID, msdUUID)
> StoragePoolMasterNotFound: Cannot find master domain:
> 'spUUID=0f63de0e-7d98-48ce-99ec-add109f83c4f,
> msdUUID=774e3604-f449-4b3e-8c06-7cd16f98720c'
> Thread-30::DEBUG::2013-04-23
> 21:36:08,527::task::852::TaskManager.Task::(_run)
> Task=`f551fa3f-9d8c-4de3-895a-964c821060d4`::Task._run:
> f551fa3f-9d8c-4de3-895a-964c821060d4
> ('0f63de0e-7d98-48ce-99ec-add109f83c4f', 1,
> '0f63de0e-7d98-48ce-99ec-add109f83c4f',
> '774e3604-f449-4b3e-8c06-7cd16f98720c', 73) {} failed - stopping task
> Thread-30::DEBUG::2013-04-23
> 21:36:08,528::task::1177::TaskManager.Task::(stop)
> Task=`f551fa3f-9d8c-4de3-895a-964c821060d4`::stopping in state preparing
> (force False)
> Thread-30::DEBUG::2013-04-23
> 21:36:08,528::task::957::TaskManager.Task::(_decref)
> Task=`f551fa3f-9d8c-4de3-895a-964c821060d4`::ref 1 aborting True
> Thread-30::INFO::2013-04-23
> 21:36:08,528::task::1134::TaskManager.Task::(prepare)
> Task=`f551fa3f-9d8c-4de3-895a-964c821060d4`::aborting: Task is aborted:
> 'Cannot find master domain' - code 304
> Thread-30::DEBUG::2013-04-23
> 21:36:08,529::task::1139::TaskManager.Task::(prepare)
> Task=`f551fa3f-9d8c-4de3-895a-964c821060d4`::Prepare: aborted: Cannot find
> master domain
> Thread-30::DEBUG::2013-04-23
> 21:36:08,529::task::957::TaskManager.Task::(_decref)
> Task=`f551fa3f-9d8c-4de3-895a-964c821060d4`::ref 0 aborting True
> Thread-30::DEBUG::2013-04-23
> 21:36:08,529::task::892::TaskManager.Task::(_doAbort)
> Task=`f551fa3f-9d8c-4de3-895a-964c821060d4`::Task._doAbort: force False
> Thread-30::DEBUG::2013-04-23
> 21:36:08,530::resourceManager::864::ResourceManager.Owner::(cancelAll)
> Owner.cancelAll requests {}
> Thread-30::DEBUG::2013-04-23
> 21:36:08,530::task::568::TaskManager.Task::(_updateState)
> Task=`f551fa3f-9d8c-4de3-895a-964c821060d4`::moving from state preparing ->
> state aborting
> Thread-30::DEBUG::2013-04-23
> 21:36:08,530::task::523::TaskManager.Task::(__state_aborting)
> Task=`f551fa3f-9d8c-4de3-895a-964c821060d4`::_aborting: recover policy none
> Thread-30::DEBUG::2013-04-23
> 21:36:08,531::task::568::TaskManager.Task::(_updateState)
> Task=`f551fa3f-9d8c-4de3-895a-964c821060d4`::moving from state aborting ->
> state failed
> Thread-30::DEBUG::2013-04-23
> 21:36:08,531::resourceManager::830::ResourceManager.Owner::(releaseAll)
> Owner.releaseAll requests {} resources {}
> Thread-30::DEBUG::2013-04-23
> 21:36:08,531::resourceManager::864::ResourceManager.Owner::(cancelAll)
> Owner.cancelAll requests {}
> Thread-30::ERROR::2013-04-23
> 21:36:08,532::dispatcher::67::Storage.Dispatcher.Protect::(run) {'status':
> {'message': "Cannot find master domain:
> 'spUUID=0f63de0e-7d98-48ce-99ec-add109f83c4f,
> msdUUID=774e3604-f449-4b3e-8c06-7cd16f98720c'", 'code': 304}}
> [root@vmserver3 vdsm]#
>
>
> _______________________________________________
> Users mailing list
> Users(a)ovirt.org
>
http://lists.ovirt.org/mailman/listinfo/users
>
>