[Users] Master domain locked, error code 304

Yeela Kaplan ykaplan at redhat.com
Thu Apr 25 07:08:56 UTC 2013


Hi,
Your problem is that the master domain is locked, so the engine does not send connectStorageServer to the vdsm host,
and therefore the host does not see the master domain.
You need to change the status of the master domain in the db from locked while the host is in maintenance.
This can be tricky and not very recommended because if you do it wrong you might corrupt the db.
Another, safer, way that I recommend is try to do connectStorageServer to the masterSD from vdsClient on the vdsm host and see what happens, it might solve your problem.

--
Yeela

----- Original Message -----
> From: "Tommy McNeely" <tommythekid at gmail.com>
> To: "Juan Jose" <jj197005 at gmail.com>
> Cc: users at ovirt.org
> Sent: Wednesday, April 24, 2013 7:30:20 PM
> Subject: Re: [Users] Master domain locked, error code 304
> 
> Hi Juan,
> 
> That sounds like a possible path to follow. Our "master" domain does not have
> any VMs in it. If no one else responds with an official path to resolution,
> then I will try going into the database and hacking it like that. I think it
> has something to do with the version or the metadata??
> 
> [root at vmserver3 dom_md]# cat metadata
> CLASS=Data
> DESCRIPTION=SFOTestMaster1
> IOOPTIMEOUTSEC=10
> LEASERETRIES=3
> LEASETIMESEC=60
> LOCKPOLICY=
> LOCKRENEWALINTERVALSEC=5
> MASTER_VERSION=1
> POOL_DESCRIPTION=SFODC01
> POOL_DOMAINS=774e3604-f449-4b3e-8c06-7cd16f98720c:Active,758c0abb-ea9a-43fb-bcd9-435f75cd0baa:Active,baa42b1c-ae2e-4486-88a1-e09e1f7a59cb:Active
> POOL_SPM_ID=1
> POOL_SPM_LVER=4
> POOL_UUID=0f63de0e-7d98-48ce-99ec-add109f83c4f
> REMOTE_PATH=10.101.0.148:/c/vpt1-master
> ROLE=Master
> SDUUID=774e3604-f449-4b3e-8c06-7cd16f98720c
> TYPE=NFS
> VERSION=0
> _SHA_CKSUM=fa8ef0e7cd5e50e107384a146e4bfc838d24ba08
> 
> 
> On Wed, Apr 24, 2013 at 5:57 AM, Juan Jose < jj197005 at gmail.com > wrote:
> 
> 
> 
> Hello Tommy,
> 
> I had a similar experience and after try to recover my storage domain, I
> realized that my VMs had missed. You have to verify if your VM disks are
> inside of your storage domain. In my case, I had to add a new a new Storage
> domain as Master domain to be able to remove the old VMs from DB and
> reattach the old storage domain. I hope this were not your case. If you
> haven't lost your VMs it's possible that you can recover them.
> 
> Good luck,
> 
> Juanjo.
> 
> 
> On Wed, Apr 24, 2013 at 6:43 AM, Tommy McNeely < tommythekid at gmail.com >
> wrote:
> 
> 
> 
> 
> We had a hard crash (network, then power) on our 2 node Ovirt Cluster. We
> have NFS datastore on CentOS 6 (3.2.0-1.39.el6). We can no longer get the
> hosts to activate. They are unable to activate the "master" domain. The
> master storage domain show "locked" while the other storage domains show
> Unknown (disks) and inactive (ISO) All the domains are on the same NFS
> server, we are able to mount it, the permissions are good. We believe we
> might be getting bit by https://bugzilla.redhat.com/show_bug.cgi?id=920694
> or http://gerrit.ovirt.org/#/c/13709/ which says to cease working on it:
> 
> Michael Kublin 		Apr 10
> 
> 
> Patch Set 5: Do not submit
> 
> Liron, please abondon this work. This interacts with host life cycle which
> will be changed, during a change a following problem will be solved as well.
> 
> 
> So, We were wondering what we can do to get our oVirt back online, or rather
> what the correct way is to solve this. We have a few VMs that are down which
> we are looking for ways to recover as quickly as possible.
> 
> Thanks in advance,
> Tommy
> 
> Here are the ovirt-engine logs:
> 
> 2013-04-23 21:30:04,041 ERROR
> [org.ovirt.engine.core.vdsbroker.VDSCommandBase] (pool-3-thread-49) Command
> ConnectStoragePoolVDS execution failed. Exception:
> IRSNoMasterDomainException: IRSGenericException: IRSErrorException:
> IRSNoMasterDomainException: Cannot find master domain:
> 'spUUID=0f63de0e-7d98-48ce-99ec-add109f83c4f,
> msdUUID=774e3604-f449-4b3e-8c06-7cd16f98720c'
> 2013-04-23 21:30:04,043 INFO
> [org.ovirt.engine.core.vdsbroker.vdsbroker.ConnectStoragePoolVDSCommand]
> (pool-3-thread-49) FINISH, ConnectStoragePoolVDSCommand, log id: 50524b34
> 2013-04-23 21:30:04,049 WARN
> [org.ovirt.engine.core.bll.storage.ReconstructMasterDomainCommand]
> (pool-3-thread-49) [7c5867d6] CanDoAction of action ReconstructMasterDomain
> failed.
> Reasons:VAR__ACTION__RECONSTRUCT_MASTER,VAR__TYPE__STORAGE__DOMAIN,ACTION_TYPE_FAILED_STORAGE_DOMAIN_STATUS_ILLEGAL2,$status
> Locked
> 
> 
> 
> Here are the logs from vdsm:
> 
> Thread-29::DEBUG::2013-04-23
> 21:36:05,906::misc::84::Storage.Misc.excCmd::(<lambda>) '/usr/bin/sudo -n
> /bin/mount -t nfs -o soft,nosharecache,timeo=600,retrans=6,nfsvers=3
> 10.101.0.148:/c/vpt1-vmdisks1
> /rhev/data-center/mnt/10.101.0.148:_c_vpt1-vmdisks1' (cwd None)
> Thread-29::DEBUG::2013-04-23
> 21:36:06,008::misc::84::Storage.Misc.excCmd::(<lambda>) '/usr/bin/sudo -n
> /bin/mount -t nfs -o soft,nosharecache,timeo=600,retrans=6,nfsvers=3
> 10.101.0.148:/c/vpool-iso /rhev/data-center/mnt/10.101.0.148:_c_vpool-iso'
> (cwd None)
> Thread-29::INFO::2013-04-23 21:36:06,065::logUtils::44::dispatcher::(wrapper)
> Run and protect: connectStorageServer, Return response: {'statuslist':
> [{'status': 0, 'id': '7c19bd42-c3dc-41b9-b81b-d9b75214b8dc'}, {'status': 0,
> 'id': 'eff2ef61-0b12-4429-b087-8742be17ae90'}]}
> Thread-29::DEBUG::2013-04-23
> 21:36:06,071::task::1151::TaskManager.Task::(prepare)
> Task=`48337e40-2446-4357-b6dc-2c86f4da67e2`::finished: {'statuslist':
> [{'status': 0, 'id': '7c19bd42-c3dc-41b9-b81b-d9b75214b8dc'}, {'status': 0,
> 'id': 'eff2ef61-0b12-4429-b087-8742be17ae90'}]}
> Thread-29::DEBUG::2013-04-23
> 21:36:06,071::task::568::TaskManager.Task::(_updateState)
> Task=`48337e40-2446-4357-b6dc-2c86f4da67e2`::moving from state preparing ->
> state finished
> Thread-29::DEBUG::2013-04-23
> 21:36:06,071::resourceManager::830::ResourceManager.Owner::(releaseAll)
> Owner.releaseAll requests {} resources {}
> Thread-29::DEBUG::2013-04-23
> 21:36:06,072::resourceManager::864::ResourceManager.Owner::(cancelAll)
> Owner.cancelAll requests {}
> Thread-29::DEBUG::2013-04-23
> 21:36:06,072::task::957::TaskManager.Task::(_decref)
> Task=`48337e40-2446-4357-b6dc-2c86f4da67e2`::ref 0 aborting False
> Thread-30::DEBUG::2013-04-23 21:36:06,112::BindingXMLRPC::161::vds::(wrapper)
> [10.101.0.197]
> Thread-30::DEBUG::2013-04-23
> 21:36:06,112::task::568::TaskManager.Task::(_updateState)
> Task=`f551fa3f-9d8c-4de3-895a-964c821060d4`::moving from state init -> state
> preparing
> Thread-30::INFO::2013-04-23 21:36:06,113::logUtils::41::dispatcher::(wrapper)
> Run and protect:
> connectStoragePool(spUUID='0f63de0e-7d98-48ce-99ec-add109f83c4f', hostID=1,
> scsiKey='0f63de0e-7d98-48ce-99ec-add109f83c4f',
> msdUUID='774e3604-f449-4b3e-8c06-7cd16f98720c', masterVersion=73,
> options=None)
> Thread-30::DEBUG::2013-04-23
> 21:36:06,113::resourceManager::190::ResourceManager.Request::(__init__)
> ResName=`Storage.0f63de0e-7d98-48ce-99ec-add109f83c4f`ReqID=`ee74329a-0a92-465a-be50-b8acc6d7246a`::Request
> was made in '/usr/share/vdsm/storage/resourceManager.py' line '189' at
> '__init__'
> Thread-30::DEBUG::2013-04-23
> 21:36:06,114::resourceManager::504::ResourceManager::(registerResource)
> Trying to register resource 'Storage.0f63de0e-7d98-48ce-99ec-add109f83c4f'
> for lock type 'exclusive'
> Thread-30::DEBUG::2013-04-23
> 21:36:06,114::resourceManager::547::ResourceManager::(registerResource)
> Resource 'Storage.0f63de0e-7d98-48ce-99ec-add109f83c4f' is free. Now locking
> as 'exclusive' (1 active user)
> Thread-30::DEBUG::2013-04-23
> 21:36:06,114::resourceManager::227::ResourceManager.Request::(grant)
> ResName=`Storage.0f63de0e-7d98-48ce-99ec-add109f83c4f`ReqID=`ee74329a-0a92-465a-be50-b8acc6d7246a`::Granted
> request
> Thread-30::INFO::2013-04-23
> 21:36:06,115::sp::625::Storage.StoragePool::(connect) Connect host #1 to the
> storage pool 0f63de0e-7d98-48ce-99ec-add109f83c4f with master domain:
> 774e3604-f449-4b3e-8c06-7cd16f98720c (ver = 73)
> Thread-30::DEBUG::2013-04-23
> 21:36:06,116::lvm::477::OperationMutex::(_invalidateAllPvs) Operation 'lvm
> invalidate operation' got the operation mutex
> Thread-30::DEBUG::2013-04-23
> 21:36:06,116::lvm::479::OperationMutex::(_invalidateAllPvs) Operation 'lvm
> invalidate operation' released the operation mutex
> Thread-30::DEBUG::2013-04-23
> 21:36:06,117::lvm::488::OperationMutex::(_invalidateAllVgs) Operation 'lvm
> invalidate operation' got the operation mutex
> Thread-30::DEBUG::2013-04-23
> 21:36:06,117::lvm::490::OperationMutex::(_invalidateAllVgs) Operation 'lvm
> invalidate operation' released the operation mutex
> Thread-30::DEBUG::2013-04-23
> 21:36:06,117::lvm::508::OperationMutex::(_invalidateAllLvs) Operation 'lvm
> invalidate operation' got the operation mutex
> Thread-30::DEBUG::2013-04-23
> 21:36:06,118::lvm::510::OperationMutex::(_invalidateAllLvs) Operation 'lvm
> invalidate operation' released the operation mutex
> Thread-30::DEBUG::2013-04-23
> 21:36:06,118::misc::1054::SamplingMethod::(__call__) Trying to enter
> sampling method (storage.sdc.refreshStorage)
> Thread-30::DEBUG::2013-04-23
> 21:36:06,118::misc::1056::SamplingMethod::(__call__) Got in to sampling
> method
> Thread-30::DEBUG::2013-04-23
> 21:36:06,119::misc::1054::SamplingMethod::(__call__) Trying to enter
> sampling method (storage.iscsi.rescan)
> Thread-30::DEBUG::2013-04-23
> 21:36:06,119::misc::1056::SamplingMethod::(__call__) Got in to sampling
> method
> Thread-30::DEBUG::2013-04-23
> 21:36:06,119::misc::84::Storage.Misc.excCmd::(<lambda>) '/usr/bin/sudo -n
> /sbin/iscsiadm -m session -R' (cwd None)
> Thread-30::DEBUG::2013-04-23
> 21:36:06,136::misc::84::Storage.Misc.excCmd::(<lambda>) FAILED: <err> =
> 'iscsiadm: No session found.\n'; <rc> = 21
> Thread-30::DEBUG::2013-04-23
> 21:36:06,136::misc::1064::SamplingMethod::(__call__) Returning last result
> MainProcess|Thread-30::DEBUG::2013-04-23
> 21:36:06,139::misc::84::Storage.Misc.excCmd::(<lambda>) '/bin/dd
> of=/sys/class/scsi_host/host0/scan' (cwd None)
> MainProcess|Thread-30::DEBUG::2013-04-23
> 21:36:06,142::misc::84::Storage.Misc.excCmd::(<lambda>) '/bin/dd
> of=/sys/class/scsi_host/host1/scan' (cwd None)
> MainProcess|Thread-30::DEBUG::2013-04-23
> 21:36:06,146::misc::84::Storage.Misc.excCmd::(<lambda>) '/bin/dd
> of=/sys/class/scsi_host/host2/scan' (cwd None)
> MainProcess|Thread-30::DEBUG::2013-04-23
> 21:36:06,149::iscsi::402::Storage.ISCSI::(forceIScsiScan) Performing SCSI
> scan, this will take up to 30 seconds
> Thread-30::DEBUG::2013-04-23
> 21:36:08,152::misc::84::Storage.Misc.excCmd::(<lambda>) '/usr/bin/sudo -n
> /sbin/multipath' (cwd None)
> Thread-30::DEBUG::2013-04-23
> 21:36:08,254::misc::84::Storage.Misc.excCmd::(<lambda>) SUCCESS: <err> = '';
> <rc> = 0
> Thread-30::DEBUG::2013-04-23
> 21:36:08,256::lvm::477::OperationMutex::(_invalidateAllPvs) Operation 'lvm
> invalidate operation' got the operation mutex
> Thread-30::DEBUG::2013-04-23
> 21:36:08,256::lvm::479::OperationMutex::(_invalidateAllPvs) Operation 'lvm
> invalidate operation' released the operation mutex
> Thread-30::DEBUG::2013-04-23
> 21:36:08,257::lvm::488::OperationMutex::(_invalidateAllVgs) Operation 'lvm
> invalidate operation' got the operation mutex
> Thread-30::DEBUG::2013-04-23
> 21:36:08,257::lvm::490::OperationMutex::(_invalidateAllVgs) Operation 'lvm
> invalidate operation' released the operation mutex
> Thread-30::DEBUG::2013-04-23
> 21:36:08,258::lvm::508::OperationMutex::(_invalidateAllLvs) Operation 'lvm
> invalidate operation' got the operation mutex
> Thread-30::DEBUG::2013-04-23
> 21:36:08,258::lvm::510::OperationMutex::(_invalidateAllLvs) Operation 'lvm
> invalidate operation' released the operation mutex
> Thread-30::DEBUG::2013-04-23
> 21:36:08,258::misc::1064::SamplingMethod::(__call__) Returning last result
> Thread-30::DEBUG::2013-04-23
> 21:36:08,259::lvm::368::OperationMutex::(_reloadvgs) Operation 'lvm reload
> operation' got the operation mutex
> Thread-30::DEBUG::2013-04-23
> 21:36:08,261::misc::84::Storage.Misc.excCmd::(<lambda>) '/usr/bin/sudo -n
> /sbin/lvm vgs --config " devices { preferred_names = [\\"^/dev/mapper/\\"]
> ignore_suspended_devices=1 write_cache_state=0 disable_after_error_count=3
> filter = [ \\"r%.*%\\" ] } global { locking_type=1 prioritise_write_locks=1
> wait_for_locks=1 } backup { retain_min = 50 retain_days = 0 } " --noheadings
> --units b --nosuffix --separator | -o
> uuid,name,attr,size,free,extent_size,extent_count,free_count,tags,vg_mda_size,vg_mda_free
> 774e3604-f449-4b3e-8c06-7cd16f98720c' (cwd None)
> Thread-30::DEBUG::2013-04-23
> 21:36:08,514::misc::84::Storage.Misc.excCmd::(<lambda>) FAILED: <err> = '
> Volume group "774e3604-f449-4b3e-8c06-7cd16f98720c" not found\n'; <rc> = 5
> Thread-30::WARNING::2013-04-23
> 21:36:08,516::lvm::373::Storage.LVM::(_reloadvgs) lvm vgs failed: 5 [] ['
> Volume group "774e3604-f449-4b3e-8c06-7cd16f98720c" not found']
> Thread-30::DEBUG::2013-04-23
> 21:36:08,518::lvm::397::OperationMutex::(_reloadvgs) Operation 'lvm reload
> operation' released the operation mutex
> Thread-30::DEBUG::2013-04-23
> 21:36:08,524::resourceManager::557::ResourceManager::(releaseResource)
> Trying to release resource 'Storage.0f63de0e-7d98-48ce-99ec-add109f83c4f'
> Thread-30::DEBUG::2013-04-23
> 21:36:08,525::resourceManager::573::ResourceManager::(releaseResource)
> Released resource 'Storage.0f63de0e-7d98-48ce-99ec-add109f83c4f' (0 active
> users)
> Thread-30::DEBUG::2013-04-23
> 21:36:08,525::resourceManager::578::ResourceManager::(releaseResource)
> Resource 'Storage.0f63de0e-7d98-48ce-99ec-add109f83c4f' is free, finding out
> if anyone is waiting for it.
> Thread-30::DEBUG::2013-04-23
> 21:36:08,525::resourceManager::585::ResourceManager::(releaseResource) No
> one is waiting for resource 'Storage.0f63de0e-7d98-48ce-99ec-add109f83c4f',
> Clearing records.
> Thread-30::ERROR::2013-04-23
> 21:36:08,526::task::833::TaskManager.Task::(_setError)
> Task=`f551fa3f-9d8c-4de3-895a-964c821060d4`::Unexpected error
> Traceback (most recent call last):
> File "/usr/share/vdsm/storage/task.py", line 840, in _run
> return fn(*args, **kargs)
> File "/usr/share/vdsm/logUtils.py", line 42, in wrapper
> res = f(*args, **kwargs)
> File "/usr/share/vdsm/storage/hsm.py", line 926, in connectStoragePool
> masterVersion, options)
> File "/usr/share/vdsm/storage/hsm.py", line 973, in _connectStoragePool
> res = pool.connect(hostID, scsiKey, msdUUID, masterVersion)
> File "/usr/share/vdsm/storage/sp.py", line 642, in connect
> self.__rebuild(msdUUID=msdUUID, masterVersion=masterVersion)
> File "/usr/share/vdsm/storage/sp.py", line 1166, in __rebuild
> self.masterDomain = self.getMasterDomain(msdUUID=msdUUID,
> masterVersion=masterVersion)
> File "/usr/share/vdsm/storage/sp.py", line 1505, in getMasterDomain
> raise se.StoragePoolMasterNotFound(self.spUUID, msdUUID)
> StoragePoolMasterNotFound: Cannot find master domain:
> 'spUUID=0f63de0e-7d98-48ce-99ec-add109f83c4f,
> msdUUID=774e3604-f449-4b3e-8c06-7cd16f98720c'
> Thread-30::DEBUG::2013-04-23
> 21:36:08,527::task::852::TaskManager.Task::(_run)
> Task=`f551fa3f-9d8c-4de3-895a-964c821060d4`::Task._run:
> f551fa3f-9d8c-4de3-895a-964c821060d4
> ('0f63de0e-7d98-48ce-99ec-add109f83c4f', 1,
> '0f63de0e-7d98-48ce-99ec-add109f83c4f',
> '774e3604-f449-4b3e-8c06-7cd16f98720c', 73) {} failed - stopping task
> Thread-30::DEBUG::2013-04-23
> 21:36:08,528::task::1177::TaskManager.Task::(stop)
> Task=`f551fa3f-9d8c-4de3-895a-964c821060d4`::stopping in state preparing
> (force False)
> Thread-30::DEBUG::2013-04-23
> 21:36:08,528::task::957::TaskManager.Task::(_decref)
> Task=`f551fa3f-9d8c-4de3-895a-964c821060d4`::ref 1 aborting True
> Thread-30::INFO::2013-04-23
> 21:36:08,528::task::1134::TaskManager.Task::(prepare)
> Task=`f551fa3f-9d8c-4de3-895a-964c821060d4`::aborting: Task is aborted:
> 'Cannot find master domain' - code 304
> Thread-30::DEBUG::2013-04-23
> 21:36:08,529::task::1139::TaskManager.Task::(prepare)
> Task=`f551fa3f-9d8c-4de3-895a-964c821060d4`::Prepare: aborted: Cannot find
> master domain
> Thread-30::DEBUG::2013-04-23
> 21:36:08,529::task::957::TaskManager.Task::(_decref)
> Task=`f551fa3f-9d8c-4de3-895a-964c821060d4`::ref 0 aborting True
> Thread-30::DEBUG::2013-04-23
> 21:36:08,529::task::892::TaskManager.Task::(_doAbort)
> Task=`f551fa3f-9d8c-4de3-895a-964c821060d4`::Task._doAbort: force False
> Thread-30::DEBUG::2013-04-23
> 21:36:08,530::resourceManager::864::ResourceManager.Owner::(cancelAll)
> Owner.cancelAll requests {}
> Thread-30::DEBUG::2013-04-23
> 21:36:08,530::task::568::TaskManager.Task::(_updateState)
> Task=`f551fa3f-9d8c-4de3-895a-964c821060d4`::moving from state preparing ->
> state aborting
> Thread-30::DEBUG::2013-04-23
> 21:36:08,530::task::523::TaskManager.Task::(__state_aborting)
> Task=`f551fa3f-9d8c-4de3-895a-964c821060d4`::_aborting: recover policy none
> Thread-30::DEBUG::2013-04-23
> 21:36:08,531::task::568::TaskManager.Task::(_updateState)
> Task=`f551fa3f-9d8c-4de3-895a-964c821060d4`::moving from state aborting ->
> state failed
> Thread-30::DEBUG::2013-04-23
> 21:36:08,531::resourceManager::830::ResourceManager.Owner::(releaseAll)
> Owner.releaseAll requests {} resources {}
> Thread-30::DEBUG::2013-04-23
> 21:36:08,531::resourceManager::864::ResourceManager.Owner::(cancelAll)
> Owner.cancelAll requests {}
> Thread-30::ERROR::2013-04-23
> 21:36:08,532::dispatcher::67::Storage.Dispatcher.Protect::(run) {'status':
> {'message': "Cannot find master domain:
> 'spUUID=0f63de0e-7d98-48ce-99ec-add109f83c4f,
> msdUUID=774e3604-f449-4b3e-8c06-7cd16f98720c'", 'code': 304}}
> [root at vmserver3 vdsm]#
> 
> 
> _______________________________________________
> Users mailing list
> Users at ovirt.org
> http://lists.ovirt.org/mailman/listinfo/users
> 
> 
> 
> 
> _______________________________________________
> Users mailing list
> Users at ovirt.org
> http://lists.ovirt.org/mailman/listinfo/users
> 



More information about the Users mailing list