From: "Yeela Kaplan" <ykaplan(a)redhat.com>
To: "Tommy McNeely" <tommythekid(a)gmail.com>
Cc: users(a)ovirt.org
Sent: Thursday, April 25, 2013 10:08:56 AM
Subject: Re: [Users] Master domain locked, error code 304
Hi,
Your problem is that the master domain is locked, so the engine does not send
connectStorageServer to the vdsm host,
and therefore the host does not see the master domain.
You need to change the status of the master domain in the db from locked
while the host is in maintenance.
This can be tricky and not very recommended because if you do it wrong you
might corrupt the db.
Another, safer, way that I recommend is try to do connectStorageServer to the
masterSD from vdsClient on the vdsm host and see what happens, it might
solve your problem.
--
Yeela
----- Original Message -----
> From: "Tommy McNeely" <tommythekid(a)gmail.com>
> To: "Juan Jose" <jj197005(a)gmail.com>
> Cc: users(a)ovirt.org
> Sent: Wednesday, April 24, 2013 7:30:20 PM
> Subject: Re: [Users] Master domain locked, error code 304
>
> Hi Juan,
>
> That sounds like a possible path to follow. Our "master" domain does not
> have
> any VMs in it. If no one else responds with an official path to resolution,
> then I will try going into the database and hacking it like that. I think
> it
> has something to do with the version or the metadata??
>
> [root@vmserver3 dom_md]# cat metadata
> CLASS=Data
> DESCRIPTION=SFOTestMaster1
> IOOPTIMEOUTSEC=10
> LEASERETRIES=3
> LEASETIMESEC=60
> LOCKPOLICY=
> LOCKRENEWALINTERVALSEC=5
> MASTER_VERSION=1
> POOL_DESCRIPTION=SFODC01
>
POOL_DOMAINS=774e3604-f449-4b3e-8c06-7cd16f98720c:Active,758c0abb-ea9a-43fb-bcd9-435f75cd0baa:Active,baa42b1c-ae2e-4486-88a1-e09e1f7a59cb:Active
> POOL_SPM_ID=1
> POOL_SPM_LVER=4
> POOL_UUID=0f63de0e-7d98-48ce-99ec-add109f83c4f
> REMOTE_PATH=10.101.0.148:/c/vpt1-master
> ROLE=Master
> SDUUID=774e3604-f449-4b3e-8c06-7cd16f98720c
> TYPE=NFS
> VERSION=0
> _SHA_CKSUM=fa8ef0e7cd5e50e107384a146e4bfc838d24ba08
>
>
> On Wed, Apr 24, 2013 at 5:57 AM, Juan Jose < jj197005(a)gmail.com > wrote:
>
>
>
> Hello Tommy,
>
> I had a similar experience and after try to recover my storage domain, I
> realized that my VMs had missed. You have to verify if your VM disks are
> inside of your storage domain. In my case, I had to add a new a new Storage
> domain as Master domain to be able to remove the old VMs from DB and
> reattach the old storage domain. I hope this were not your case. If you
> haven't lost your VMs it's possible that you can recover them.
>
> Good luck,
>
> Juanjo.
>
>
> On Wed, Apr 24, 2013 at 6:43 AM, Tommy McNeely < tommythekid(a)gmail.com >
> wrote:
>
>
>
>
> We had a hard crash (network, then power) on our 2 node Ovirt Cluster. We
> have NFS datastore on CentOS 6 (3.2.0-1.39.el6). We can no longer get the
> hosts to activate. They are unable to activate the "master" domain. The
> master storage domain show "locked" while the other storage domains show
> Unknown (disks) and inactive (ISO) All the domains are on the same NFS
> server, we are able to mount it, the permissions are good. We believe we
> might be getting bit by
https://bugzilla.redhat.com/show_bug.cgi?id=920694
> or
http://gerrit.ovirt.org/#/c/13709/ which says to cease working on it:
>
> Michael Kublin Apr 10
>
>
> Patch Set 5: Do not submit
>
> Liron, please abondon this work. This interacts with host life cycle which
> will be changed, during a change a following problem will be solved as
> well.
>
>
> So, We were wondering what we can do to get our oVirt back online, or
> rather
> what the correct way is to solve this. We have a few VMs that are down
> which
> we are looking for ways to recover as quickly as possible.
>
> Thanks in advance,
> Tommy
>
> Here are the ovirt-engine logs:
>
> 2013-04-23 21:30:04,041 ERROR
> [org.ovirt.engine.core.vdsbroker.VDSCommandBase] (pool-3-thread-49) Command
> ConnectStoragePoolVDS execution failed. Exception:
> IRSNoMasterDomainException: IRSGenericException: IRSErrorException:
> IRSNoMasterDomainException: Cannot find master domain:
> 'spUUID=0f63de0e-7d98-48ce-99ec-add109f83c4f,
> msdUUID=774e3604-f449-4b3e-8c06-7cd16f98720c'
> 2013-04-23 21:30:04,043 INFO
> [org.ovirt.engine.core.vdsbroker.vdsbroker.ConnectStoragePoolVDSCommand]
> (pool-3-thread-49) FINISH, ConnectStoragePoolVDSCommand, log id: 50524b34
> 2013-04-23 21:30:04,049 WARN
> [org.ovirt.engine.core.bll.storage.ReconstructMasterDomainCommand]
> (pool-3-thread-49) [7c5867d6] CanDoAction of action ReconstructMasterDomain
> failed.
>
Reasons:VAR__ACTION__RECONSTRUCT_MASTER,VAR__TYPE__STORAGE__DOMAIN,ACTION_TYPE_FAILED_STORAGE_DOMAIN_STATUS_ILLEGAL2,$status
> Locked
>
>
Hi, domain stuck in status Locked it is a bug and it is not directly related
to discussed patch.
No actions in vdsm can help in such situation, please do the following:
If domains are marked as Locked in GUI they should be unlocked in DB.
My advice is to put host in maintainence, after that
please run the following query : update storage_pool_iso_map set status = 0 where
storage_id=...
(Info about domains is located inside storage_domain_static
table)
Activate a host, after that host should try to connect to all storages and to pool again
and reconstruct will run and I hope will
success.
>
> Here are the logs from vdsm:
>
> Thread-29::DEBUG::2013-04-23
> 21:36:05,906::misc::84::Storage.Misc.excCmd::(<lambda>) '/usr/bin/sudo -n
> /bin/mount -t nfs -o soft,nosharecache,timeo=600,retrans=6,nfsvers=3
> 10.101.0.148:/c/vpt1-vmdisks1
> /rhev/data-center/mnt/10.101.0.148:_c_vpt1-vmdisks1' (cwd None)
> Thread-29::DEBUG::2013-04-23
> 21:36:06,008::misc::84::Storage.Misc.excCmd::(<lambda>) '/usr/bin/sudo -n
> /bin/mount -t nfs -o soft,nosharecache,timeo=600,retrans=6,nfsvers=3
> 10.101.0.148:/c/vpool-iso /rhev/data-center/mnt/10.101.0.148:_c_vpool-iso'
> (cwd None)
> Thread-29::INFO::2013-04-23
> 21:36:06,065::logUtils::44::dispatcher::(wrapper)
> Run and protect: connectStorageServer, Return response: {'statuslist':
> [{'status': 0, 'id':
'7c19bd42-c3dc-41b9-b81b-d9b75214b8dc'}, {'status': 0,
> 'id': 'eff2ef61-0b12-4429-b087-8742be17ae90'}]}
> Thread-29::DEBUG::2013-04-23
> 21:36:06,071::task::1151::TaskManager.Task::(prepare)
> Task=`48337e40-2446-4357-b6dc-2c86f4da67e2`::finished: {'statuslist':
> [{'status': 0, 'id':
'7c19bd42-c3dc-41b9-b81b-d9b75214b8dc'}, {'status': 0,
> 'id': 'eff2ef61-0b12-4429-b087-8742be17ae90'}]}
> Thread-29::DEBUG::2013-04-23
> 21:36:06,071::task::568::TaskManager.Task::(_updateState)
> Task=`48337e40-2446-4357-b6dc-2c86f4da67e2`::moving from state preparing ->
> state finished
> Thread-29::DEBUG::2013-04-23
> 21:36:06,071::resourceManager::830::ResourceManager.Owner::(releaseAll)
> Owner.releaseAll requests {} resources {}
> Thread-29::DEBUG::2013-04-23
> 21:36:06,072::resourceManager::864::ResourceManager.Owner::(cancelAll)
> Owner.cancelAll requests {}
> Thread-29::DEBUG::2013-04-23
> 21:36:06,072::task::957::TaskManager.Task::(_decref)
> Task=`48337e40-2446-4357-b6dc-2c86f4da67e2`::ref 0 aborting False
> Thread-30::DEBUG::2013-04-23
> 21:36:06,112::BindingXMLRPC::161::vds::(wrapper)
> [10.101.0.197]
> Thread-30::DEBUG::2013-04-23
> 21:36:06,112::task::568::TaskManager.Task::(_updateState)
> Task=`f551fa3f-9d8c-4de3-895a-964c821060d4`::moving from state init ->
> state
> preparing
> Thread-30::INFO::2013-04-23
> 21:36:06,113::logUtils::41::dispatcher::(wrapper)
> Run and protect:
> connectStoragePool(spUUID='0f63de0e-7d98-48ce-99ec-add109f83c4f', hostID=1,
> scsiKey='0f63de0e-7d98-48ce-99ec-add109f83c4f',
> msdUUID='774e3604-f449-4b3e-8c06-7cd16f98720c', masterVersion=73,
> options=None)
> Thread-30::DEBUG::2013-04-23
> 21:36:06,113::resourceManager::190::ResourceManager.Request::(__init__)
>
ResName=`Storage.0f63de0e-7d98-48ce-99ec-add109f83c4f`ReqID=`ee74329a-0a92-465a-be50-b8acc6d7246a`::Request
> was made in '/usr/share/vdsm/storage/resourceManager.py' line '189'
at
> '__init__'
> Thread-30::DEBUG::2013-04-23
> 21:36:06,114::resourceManager::504::ResourceManager::(registerResource)
> Trying to register resource 'Storage.0f63de0e-7d98-48ce-99ec-add109f83c4f'
> for lock type 'exclusive'
> Thread-30::DEBUG::2013-04-23
> 21:36:06,114::resourceManager::547::ResourceManager::(registerResource)
> Resource 'Storage.0f63de0e-7d98-48ce-99ec-add109f83c4f' is free. Now
> locking
> as 'exclusive' (1 active user)
> Thread-30::DEBUG::2013-04-23
> 21:36:06,114::resourceManager::227::ResourceManager.Request::(grant)
>
ResName=`Storage.0f63de0e-7d98-48ce-99ec-add109f83c4f`ReqID=`ee74329a-0a92-465a-be50-b8acc6d7246a`::Granted
> request
> Thread-30::INFO::2013-04-23
> 21:36:06,115::sp::625::Storage.StoragePool::(connect) Connect host #1 to
> the
> storage pool 0f63de0e-7d98-48ce-99ec-add109f83c4f with master domain:
> 774e3604-f449-4b3e-8c06-7cd16f98720c (ver = 73)
> Thread-30::DEBUG::2013-04-23
> 21:36:06,116::lvm::477::OperationMutex::(_invalidateAllPvs) Operation 'lvm
> invalidate operation' got the operation mutex
> Thread-30::DEBUG::2013-04-23
> 21:36:06,116::lvm::479::OperationMutex::(_invalidateAllPvs) Operation 'lvm
> invalidate operation' released the operation mutex
> Thread-30::DEBUG::2013-04-23
> 21:36:06,117::lvm::488::OperationMutex::(_invalidateAllVgs) Operation 'lvm
> invalidate operation' got the operation mutex
> Thread-30::DEBUG::2013-04-23
> 21:36:06,117::lvm::490::OperationMutex::(_invalidateAllVgs) Operation 'lvm
> invalidate operation' released the operation mutex
> Thread-30::DEBUG::2013-04-23
> 21:36:06,117::lvm::508::OperationMutex::(_invalidateAllLvs) Operation 'lvm
> invalidate operation' got the operation mutex
> Thread-30::DEBUG::2013-04-23
> 21:36:06,118::lvm::510::OperationMutex::(_invalidateAllLvs) Operation 'lvm
> invalidate operation' released the operation mutex
> Thread-30::DEBUG::2013-04-23
> 21:36:06,118::misc::1054::SamplingMethod::(__call__) Trying to enter
> sampling method (storage.sdc.refreshStorage)
> Thread-30::DEBUG::2013-04-23
> 21:36:06,118::misc::1056::SamplingMethod::(__call__) Got in to sampling
> method
> Thread-30::DEBUG::2013-04-23
> 21:36:06,119::misc::1054::SamplingMethod::(__call__) Trying to enter
> sampling method (storage.iscsi.rescan)
> Thread-30::DEBUG::2013-04-23
> 21:36:06,119::misc::1056::SamplingMethod::(__call__) Got in to sampling
> method
> Thread-30::DEBUG::2013-04-23
> 21:36:06,119::misc::84::Storage.Misc.excCmd::(<lambda>) '/usr/bin/sudo -n
> /sbin/iscsiadm -m session -R' (cwd None)
> Thread-30::DEBUG::2013-04-23
> 21:36:06,136::misc::84::Storage.Misc.excCmd::(<lambda>) FAILED: <err> =
> 'iscsiadm: No session found.\n'; <rc> = 21
> Thread-30::DEBUG::2013-04-23
> 21:36:06,136::misc::1064::SamplingMethod::(__call__) Returning last result
> MainProcess|Thread-30::DEBUG::2013-04-23
> 21:36:06,139::misc::84::Storage.Misc.excCmd::(<lambda>) '/bin/dd
> of=/sys/class/scsi_host/host0/scan' (cwd None)
> MainProcess|Thread-30::DEBUG::2013-04-23
> 21:36:06,142::misc::84::Storage.Misc.excCmd::(<lambda>) '/bin/dd
> of=/sys/class/scsi_host/host1/scan' (cwd None)
> MainProcess|Thread-30::DEBUG::2013-04-23
> 21:36:06,146::misc::84::Storage.Misc.excCmd::(<lambda>) '/bin/dd
> of=/sys/class/scsi_host/host2/scan' (cwd None)
> MainProcess|Thread-30::DEBUG::2013-04-23
> 21:36:06,149::iscsi::402::Storage.ISCSI::(forceIScsiScan) Performing SCSI
> scan, this will take up to 30 seconds
> Thread-30::DEBUG::2013-04-23
> 21:36:08,152::misc::84::Storage.Misc.excCmd::(<lambda>) '/usr/bin/sudo -n
> /sbin/multipath' (cwd None)
> Thread-30::DEBUG::2013-04-23
> 21:36:08,254::misc::84::Storage.Misc.excCmd::(<lambda>) SUCCESS: <err>
=
> '';
> <rc> = 0
> Thread-30::DEBUG::2013-04-23
> 21:36:08,256::lvm::477::OperationMutex::(_invalidateAllPvs) Operation 'lvm
> invalidate operation' got the operation mutex
> Thread-30::DEBUG::2013-04-23
> 21:36:08,256::lvm::479::OperationMutex::(_invalidateAllPvs) Operation 'lvm
> invalidate operation' released the operation mutex
> Thread-30::DEBUG::2013-04-23
> 21:36:08,257::lvm::488::OperationMutex::(_invalidateAllVgs) Operation 'lvm
> invalidate operation' got the operation mutex
> Thread-30::DEBUG::2013-04-23
> 21:36:08,257::lvm::490::OperationMutex::(_invalidateAllVgs) Operation 'lvm
> invalidate operation' released the operation mutex
> Thread-30::DEBUG::2013-04-23
> 21:36:08,258::lvm::508::OperationMutex::(_invalidateAllLvs) Operation 'lvm
> invalidate operation' got the operation mutex
> Thread-30::DEBUG::2013-04-23
> 21:36:08,258::lvm::510::OperationMutex::(_invalidateAllLvs) Operation 'lvm
> invalidate operation' released the operation mutex
> Thread-30::DEBUG::2013-04-23
> 21:36:08,258::misc::1064::SamplingMethod::(__call__) Returning last result
> Thread-30::DEBUG::2013-04-23
> 21:36:08,259::lvm::368::OperationMutex::(_reloadvgs) Operation 'lvm reload
> operation' got the operation mutex
> Thread-30::DEBUG::2013-04-23
> 21:36:08,261::misc::84::Storage.Misc.excCmd::(<lambda>) '/usr/bin/sudo -n
> /sbin/lvm vgs --config " devices { preferred_names =
[\\"^/dev/mapper/\\"]
> ignore_suspended_devices=1 write_cache_state=0 disable_after_error_count=3
> filter = [ \\"r%.*%\\" ] } global { locking_type=1
prioritise_write_locks=1
> wait_for_locks=1 } backup { retain_min = 50 retain_days = 0 } "
> --noheadings
> --units b --nosuffix --separator | -o
>
uuid,name,attr,size,free,extent_size,extent_count,free_count,tags,vg_mda_size,vg_mda_free
> 774e3604-f449-4b3e-8c06-7cd16f98720c' (cwd None)
> Thread-30::DEBUG::2013-04-23
> 21:36:08,514::misc::84::Storage.Misc.excCmd::(<lambda>) FAILED: <err> =
'
> Volume group "774e3604-f449-4b3e-8c06-7cd16f98720c" not found\n';
<rc> = 5
> Thread-30::WARNING::2013-04-23
> 21:36:08,516::lvm::373::Storage.LVM::(_reloadvgs) lvm vgs failed: 5 [] ['
> Volume group "774e3604-f449-4b3e-8c06-7cd16f98720c" not found']
> Thread-30::DEBUG::2013-04-23
> 21:36:08,518::lvm::397::OperationMutex::(_reloadvgs) Operation 'lvm reload
> operation' released the operation mutex
> Thread-30::DEBUG::2013-04-23
> 21:36:08,524::resourceManager::557::ResourceManager::(releaseResource)
> Trying to release resource 'Storage.0f63de0e-7d98-48ce-99ec-add109f83c4f'
> Thread-30::DEBUG::2013-04-23
> 21:36:08,525::resourceManager::573::ResourceManager::(releaseResource)
> Released resource 'Storage.0f63de0e-7d98-48ce-99ec-add109f83c4f' (0 active
> users)
> Thread-30::DEBUG::2013-04-23
> 21:36:08,525::resourceManager::578::ResourceManager::(releaseResource)
> Resource 'Storage.0f63de0e-7d98-48ce-99ec-add109f83c4f' is free, finding
> out
> if anyone is waiting for it.
> Thread-30::DEBUG::2013-04-23
> 21:36:08,525::resourceManager::585::ResourceManager::(releaseResource) No
> one is waiting for resource 'Storage.0f63de0e-7d98-48ce-99ec-add109f83c4f',
> Clearing records.
> Thread-30::ERROR::2013-04-23
> 21:36:08,526::task::833::TaskManager.Task::(_setError)
> Task=`f551fa3f-9d8c-4de3-895a-964c821060d4`::Unexpected error
> Traceback (most recent call last):
> File "/usr/share/vdsm/storage/task.py", line 840, in _run
> return fn(*args, **kargs)
> File "/usr/share/vdsm/logUtils.py", line 42, in wrapper
> res = f(*args, **kwargs)
> File "/usr/share/vdsm/storage/hsm.py", line 926, in connectStoragePool
> masterVersion, options)
> File "/usr/share/vdsm/storage/hsm.py", line 973, in _connectStoragePool
> res = pool.connect(hostID, scsiKey, msdUUID, masterVersion)
> File "/usr/share/vdsm/storage/sp.py", line 642, in connect
> self.__rebuild(msdUUID=msdUUID, masterVersion=masterVersion)
> File "/usr/share/vdsm/storage/sp.py", line 1166, in __rebuild
> self.masterDomain = self.getMasterDomain(msdUUID=msdUUID,
> masterVersion=masterVersion)
> File "/usr/share/vdsm/storage/sp.py", line 1505, in getMasterDomain
> raise se.StoragePoolMasterNotFound(self.spUUID, msdUUID)
> StoragePoolMasterNotFound: Cannot find master domain:
> 'spUUID=0f63de0e-7d98-48ce-99ec-add109f83c4f,
> msdUUID=774e3604-f449-4b3e-8c06-7cd16f98720c'
> Thread-30::DEBUG::2013-04-23
> 21:36:08,527::task::852::TaskManager.Task::(_run)
> Task=`f551fa3f-9d8c-4de3-895a-964c821060d4`::Task._run:
> f551fa3f-9d8c-4de3-895a-964c821060d4
> ('0f63de0e-7d98-48ce-99ec-add109f83c4f', 1,
> '0f63de0e-7d98-48ce-99ec-add109f83c4f',
> '774e3604-f449-4b3e-8c06-7cd16f98720c', 73) {} failed - stopping task
> Thread-30::DEBUG::2013-04-23
> 21:36:08,528::task::1177::TaskManager.Task::(stop)
> Task=`f551fa3f-9d8c-4de3-895a-964c821060d4`::stopping in state preparing
> (force False)
> Thread-30::DEBUG::2013-04-23
> 21:36:08,528::task::957::TaskManager.Task::(_decref)
> Task=`f551fa3f-9d8c-4de3-895a-964c821060d4`::ref 1 aborting True
> Thread-30::INFO::2013-04-23
> 21:36:08,528::task::1134::TaskManager.Task::(prepare)
> Task=`f551fa3f-9d8c-4de3-895a-964c821060d4`::aborting: Task is aborted:
> 'Cannot find master domain' - code 304
> Thread-30::DEBUG::2013-04-23
> 21:36:08,529::task::1139::TaskManager.Task::(prepare)
> Task=`f551fa3f-9d8c-4de3-895a-964c821060d4`::Prepare: aborted: Cannot find
> master domain
> Thread-30::DEBUG::2013-04-23
> 21:36:08,529::task::957::TaskManager.Task::(_decref)
> Task=`f551fa3f-9d8c-4de3-895a-964c821060d4`::ref 0 aborting True
> Thread-30::DEBUG::2013-04-23
> 21:36:08,529::task::892::TaskManager.Task::(_doAbort)
> Task=`f551fa3f-9d8c-4de3-895a-964c821060d4`::Task._doAbort: force False
> Thread-30::DEBUG::2013-04-23
> 21:36:08,530::resourceManager::864::ResourceManager.Owner::(cancelAll)
> Owner.cancelAll requests {}
> Thread-30::DEBUG::2013-04-23
> 21:36:08,530::task::568::TaskManager.Task::(_updateState)
> Task=`f551fa3f-9d8c-4de3-895a-964c821060d4`::moving from state preparing ->
> state aborting
> Thread-30::DEBUG::2013-04-23
> 21:36:08,530::task::523::TaskManager.Task::(__state_aborting)
> Task=`f551fa3f-9d8c-4de3-895a-964c821060d4`::_aborting: recover policy none
> Thread-30::DEBUG::2013-04-23
> 21:36:08,531::task::568::TaskManager.Task::(_updateState)
> Task=`f551fa3f-9d8c-4de3-895a-964c821060d4`::moving from state aborting ->
> state failed
> Thread-30::DEBUG::2013-04-23
> 21:36:08,531::resourceManager::830::ResourceManager.Owner::(releaseAll)
> Owner.releaseAll requests {} resources {}
> Thread-30::DEBUG::2013-04-23
> 21:36:08,531::resourceManager::864::ResourceManager.Owner::(cancelAll)
> Owner.cancelAll requests {}
> Thread-30::ERROR::2013-04-23
> 21:36:08,532::dispatcher::67::Storage.Dispatcher.Protect::(run) {'status':
> {'message': "Cannot find master domain:
> 'spUUID=0f63de0e-7d98-48ce-99ec-add109f83c4f,
> msdUUID=774e3604-f449-4b3e-8c06-7cd16f98720c'", 'code': 304}}
> [root@vmserver3 vdsm]#
>
>
> _______________________________________________
> Users mailing list
> Users(a)ovirt.org
>
http://lists.ovirt.org/mailman/listinfo/users
>
>
>
>
> _______________________________________________
> Users mailing list
> Users(a)ovirt.org
>
http://lists.ovirt.org/mailman/listinfo/users
>
_______________________________________________
Users mailing list
Users(a)ovirt.org
http://lists.ovirt.org/mailman/listinfo/users