[ovirt-users] ovirt 3.6, we had the ovirt manager go down in a bad way and all VMs for one node marked Unknown and Not Reponding while up
Christopher Cox
ccox at endlessnow.com
Mon Feb 5 20:53:36 UTC 2018
Answering my own post... a restart of vdsmd on the affected blade has
fixed everything. Thanks everyone who helped.
On 02/05/2018 10:02 AM, Christopher Cox wrote:
> Forgive the top post. I guess what I need to know now is whether there
> is a recovery path that doesn't lead to total loss of the VMs that are
> currently in the "Unknown" "Not responding" state.
>
> We are planning a total oVirt shutdown. I just would like to know if
> we've effectively lot those VMs or not. Again, the VMs are currently
> "up". And we use a file backup process, so in theory they can be
> restored, just somewhat painfully, from scratch.
>
> But if somebody knows if we shutdown all the bad VMs and the blade, is
> there someway oVirt can know the VMs are "ok" to start up?? Will
> changing their state directly to "down" in the db stick if the blade is
> down? That is, will we get to a known state where the VMs can actually
> be started and brought back into a known state?
>
> Right now, we're feeling there's a good chance we will not be able to
> recover these VMs, even though they are "up" right now. I really need
> some way to force oVirt into an integral state, even if it means we take
> the whole thing down.
>
> Possible?
>
>
> On 01/25/2018 06:57 PM, Christopher Cox wrote:
>>
>>
>> On 01/25/2018 04:57 PM, Douglas Landgraf wrote:
>>> On Thu, Jan 25, 2018 at 5:12 PM, Christopher Cox
>>> <ccox at endlessnow.com> wrote:
>>>> On 01/25/2018 02:25 PM, Douglas Landgraf wrote:
>>>>>
>>>>> On Wed, Jan 24, 2018 at 10:18 AM, Christopher Cox
>>>>> <ccox at endlessnow.com>
>>>>> wrote:
>>>>>>
>>>>>> Would restarting vdsm on the node in question help fix this?
>>>>>> Again, all
>>>>>> the
>>>>>> VMs are up on the node. Prior attempts to fix this problem have
>>>>>> left the
>>>>>> node in a state where I can issue the "has been rebooted" command
>>>>>> to it,
>>>>>> it's confused.
>>>>>>
>>>>>> So... node is up. All VMs are up. Can't issue "has been
>>>>>> rebooted" to
>>>>>> the
>>>>>> node, all VMs show Unknown and not responding but they are up.
>>>>>>
>>>>>> Chaning the status is the ovirt db to 0 works for a second and
>>>>>> then it
>>>>>> goes
>>>>>> immediately back to 8 (which is why I'm wondering if I should restart
>>>>>> vdsm
>>>>>> on the node).
>>>>>
>>>>>
>>>>> It's not recommended to change db manually.
>>>>>
>>>>>>
>>>>>> Oddly enough, we're running all of this in production. So,
>>>>>> watching it
>>>>>> all
>>>>>> go down isn't the best option for us.
>>>>>>
>>>>>> Any advice is welcome.
>>>>>
>>>>>
>>>>>
>>>>> We would need to see the node/engine logs, have you found any error in
>>>>> the vdsm.log
>>>>> (from nodes) or engine.log? Could you please share the error?
>>>>
>>>>
>>>>
>>>> In short, the error is our ovirt manager lost network (our problem) and
>>>> crashed hard (hardware issue on the server).. On bring up, we had some
>>>> network changes (that caused the lost network problem) so our LACP
>>>> bond was
>>>> down for a bit while we were trying to bring it up (noting the ovirt
>>>> manager
>>>> is up while we're reestablishing the network on the switch side).
>>>>
>>>> In other word, that's the "error" so to speak that got us to where
>>>> we are.
>>>>
>>>> Full DEBUG enabled on the logs... The error messages seem obvious to
>>>> me..
>>>> starts like this (nothing the ISO DOMAIN was coming off an NFS mount
>>>> off the
>>>> ovirt management server... yes... we know... we do have plans to
>>>> move that).
>>>>
>>>> So on the hypervisor node itself, from the vdsm.log (vdsm.log.33.xz):
>>>>
>>>> (hopefully no surprise here)
>>>>
>>>> Thread-2426633::WARNING::2018-01-23
>>>> 13:50:56,672::fileSD::749::Storage.scanDomains::(collectMetaFiles)
>>>> Could not
>>>> collect metadata file for domain path
>>>> /rhev/data-center/mnt/d0lppc129.skopos.me:_var_lib_exports_iso-20160408002844
>>>>
>>>> Traceback (most recent call last):
>>>> File "/usr/share/vdsm/storage/fileSD.py", line 735, in
>>>> collectMetaFiles
>>>> sd.DOMAIN_META_DATA))
>>>> File "/usr/share/vdsm/storage/outOfProcess.py", line 121, in glob
>>>> return self._iop.glob(pattern)
>>>> File "/usr/lib/python2.7/site-packages/ioprocess/__init__.py",
>>>> line 536,
>>>> in glob
>>>> return self._sendCommand("glob", {"pattern": pattern},
>>>> self.timeout)
>>>> File "/usr/lib/python2.7/site-packages/ioprocess/__init__.py",
>>>> line 421,
>>>> in _sendCommand
>>>> raise Timeout(os.strerror(errno.ETIMEDOUT))
>>>> Timeout: Connection timed out
>>>> Thread-27::ERROR::2018-01-23
>>>> 13:50:56,672::sdc::145::Storage.StorageDomainCache::(_findDomain)
>>>> domain
>>>> e5ecae2f-5a06-4743-9a43-e74d83992c35 not found
>>>> Traceback (most recent call last):
>>>> File "/usr/share/vdsm/storage/sdc.py", line 143, in _findDomain
>>>> dom = findMethod(sdUUID)
>>>> File "/usr/share/vdsm/storage/nfsSD.py", line 122, in findDomain
>>>> return NfsStorageDomain(NfsStorageDomain.findDomainPath(sdUUID))
>>>> File "/usr/share/vdsm/storage/nfsSD.py", line 112, in findDomainPath
>>>> raise se.StorageDomainDoesNotExist(sdUUID)
>>>> StorageDomainDoesNotExist: Storage domain does not exist:
>>>> (u'e5ecae2f-5a06-4743-9a43-e74d83992c35',)
>>>> Thread-27::ERROR::2018-01-23
>>>> 13:50:56,673::monitor::276::Storage.Monitor::(_monitorDomain) Error
>>>> monitoring domain e5ecae2f-5a06-4743-9a43-e74d83992c35
>>>> Traceback (most recent call last):
>>>> File "/usr/share/vdsm/storage/monitor.py", line 272, in
>>>> _monitorDomain
>>>> self._performDomainSelftest()
>>>> File "/usr/lib/python2.7/site-packages/vdsm/utils.py", line 769, in
>>>> wrapper
>>>> value = meth(self, *a, **kw)
>>>> File "/usr/share/vdsm/storage/monitor.py", line 339, in
>>>> _performDomainSelftest
>>>> self.domain.selftest()
>>>> File "/usr/share/vdsm/storage/sdc.py", line 49, in __getattr__
>>>> return getattr(self.getRealDomain(), attrName)
>>>> File "/usr/share/vdsm/storage/sdc.py", line 52, in getRealDomain
>>>> return self._cache._realProduce(self._sdUUID)
>>>> File "/usr/share/vdsm/storage/sdc.py", line 124, in _realProduce
>>>> domain = self._findDomain(sdUUID)
>>>> File "/usr/share/vdsm/storage/sdc.py", line 143, in _findDomain
>>>> dom = findMethod(sdUUID)
>>>> File "/usr/share/vdsm/storage/nfsSD.py", line 122, in findDomain
>>>> return NfsStorageDomain(NfsStorageDomain.findDomainPath(sdUUID))
>>>> File "/usr/share/vdsm/storage/nfsSD.py", line 112, in findDomainPath
>>>> raise se.StorageDomainDoesNotExist(sdUUID)
>>>> StorageDomainDoesNotExist: Storage domain does not exist:
>>>> (u'e5ecae2f-5a06-4743-9a43-e74d83992c35',)
>>>>
>>>>
>>>> Again, all the hypervisor nodes will complain about having the NFS
>>>> area for
>>>> ISO DOMAIN now gone. Remember the ovirt manager node held this and
>>>> it has
>>>> now network has gone out and the node crashed (note: the ovirt node
>>>> (the
>>>> actual server box) shouldn't crash due to the network outage, but it
>>>> did.
>>>
>>>
>>> I have added VDSM people in this thread to review it. I am assuming
>>> the network changes (during the crash) still make the storage domain
>>> available for the nodes.
>>
>> Ideally, nothing was lost node wise (neither LAN nor iSCSI), just the
>> ovirt manager lost its network connection. So the only thing, as I
>> mentioned, storage wise that was lost was the ISO DOMAIN which was
>> NFS'd off the ovirt manager.
>>
>>>
>>>>
>>>> So here is the engine collapse as it lost network connectivity
>>>> (before the
>>>> server actually crashed hard).
>>>>
>>>> 2018-01-23 13:45:33,666 ERROR
>>>> [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector]
>>>> (DefaultQuartzScheduler_Worker-87) [] Correlation ID: null, Call Stack:
>>>> null, Custom Event ID: -1, Message: VDSM d0lppn067 command failed:
>>>> Heartbeat
>>>> exeeded
>>>> 2018-01-23 13:45:33,666 ERROR
>>>> [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector]
>>>> (DefaultQuartzScheduler_Worker-10) [21574461] Correlation ID: null,
>>>> Call
>>>> Stack: null, Custom Event ID: -1, Message: VDSM d0lppn072 command
>>>> failed:
>>>> Heartbeat exeeded
>>>> 2018-01-23 13:45:33,666 ERROR
>>>> [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector]
>>>> (DefaultQuartzScheduler_Worker-37) [4e8ec41d] Correlation ID: null,
>>>> Call
>>>> Stack: null, Custom Event ID: -1, Message: VDSM d0lppn066 command
>>>> failed:
>>>> Heartbeat exeeded
>>>> 2018-01-23 13:45:33,667 ERROR
>>>> [org.ovirt.engine.core.vdsbroker.vdsbroker.GetStatsVDSCommand]
>>>> (DefaultQuartzScheduler_Worker-87) [] Command
>>>> 'GetStatsVDSCommand(HostName =
>>>> d0lppn067, VdsIdAndVdsVDSCommandParametersBase:{runAsync='true',
>>>> hostId='f99c68c8-b0e8-437b-8cd9-ebaddaaede96',
>>>> vds='Host[d0lppn067,f99c68c8-b0e8-437b-8cd9-ebaddaaede96]'})' execution
>>>> failed: VDSGenericException: VDSNetworkException: Heartbeat exeeded
>>>> 2018-01-23 13:45:33,667 ERROR
>>>> [org.ovirt.engine.core.vdsbroker.vdsbroker.GetStatsVDSCommand]
>>>> (DefaultQuartzScheduler_Worker-10) [21574461] Command
>>>> 'GetStatsVDSCommand(HostName = d0lppn072,
>>>> VdsIdAndVdsVDSCommandParametersBase:{runAsync='true',
>>>> hostId='fdc00296-973d-4268-bd79-6dac535974e0',
>>>> vds='Host[d0lppn072,fdc00296-973d-4268-bd79-6dac535974e0]'})' execution
>>>> failed: VDSGenericException: VDSNetworkException: Heartbeat exeeded
>>>> 2018-01-23 13:45:33,667 ERROR
>>>> [org.ovirt.engine.core.vdsbroker.vdsbroker.GetStatsVDSCommand]
>>>> (DefaultQuartzScheduler_Worker-37) [4e8ec41d] Command
>>>> 'GetStatsVDSCommand(HostName = d0lppn066,
>>>> VdsIdAndVdsVDSCommandParametersBase:{runAsync='true',
>>>> hostId='14abf559-4b62-4ebd-a345-77fa9e1fa3ae',
>>>> vds='Host[d0lppn066,14abf559-4b62-4ebd-a345-77fa9e1fa3ae]'})' execution
>>>> failed: VDSGenericException: VDSNetworkException: Heartbeat exeeded
>>>> 2018-01-23 13:45:33,669 ERROR
>>>> [org.ovirt.engine.core.vdsbroker.HostMonitoring]
>>>> (DefaultQuartzScheduler_Worker-87) [] Failed getting vds stats,
>>>> vds='d0lppn067'(f99c68c8-b0e8-437b-8cd9-ebaddaaede96):
>>>> org.ovirt.engine.core.vdsbroker.vdsbroker.VDSNetworkException:
>>>> VDSGenericException: VDSNetworkException: Heartbeat exeeded
>>>> 2018-01-23 13:45:33,669 ERROR
>>>> [org.ovirt.engine.core.vdsbroker.HostMonitoring]
>>>> (DefaultQuartzScheduler_Worker-10) [21574461] Failed getting vds
>>>> stats,
>>>> vds='d0lppn072'(fdc00296-973d-4268-bd79-6dac535974e0):
>>>> org.ovirt.engine.core.vdsbroker.vdsbroker.VDSNetworkException:
>>>> VDSGenericException: VDSNetworkException: Heartbeat exeeded
>>>> 2018-01-23 13:45:33,669 ERROR
>>>> [org.ovirt.engine.core.vdsbroker.HostMonitoring]
>>>> (DefaultQuartzScheduler_Worker-37) [4e8ec41d] Failed getting vds
>>>> stats,
>>>> vds='d0lppn066'(14abf559-4b62-4ebd-a345-77fa9e1fa3ae):
>>>> org.ovirt.engine.core.vdsbroker.vdsbroker.VDSNetworkException:
>>>> VDSGenericException: VDSNetworkException: Heartbeat exeeded
>>>> 2018-01-23 13:45:33,671 ERROR
>>>> [org.ovirt.engine.core.vdsbroker.HostMonitoring]
>>>> (DefaultQuartzScheduler_Worker-10) [21574461] Failure to refresh Vds
>>>> runtime
>>>> info: VDSGenericException: VDSNetworkException: Heartbeat exeeded
>>>> 2018-01-23 13:45:33,671 ERROR
>>>> [org.ovirt.engine.core.vdsbroker.HostMonitoring]
>>>> (DefaultQuartzScheduler_Worker-37) [4e8ec41d] Failure to refresh Vds
>>>> runtime
>>>> info: VDSGenericException: VDSNetworkException: Heartbeat exeeded
>>>> 2018-01-23 13:45:33,671 ERROR
>>>> [org.ovirt.engine.core.vdsbroker.HostMonitoring]
>>>> (DefaultQuartzScheduler_Worker-87) [] Failure to refresh Vds runtime
>>>> info:
>>>> VDSGenericException: VDSNetworkException: Heartbeat exeeded
>>>> 2018-01-23 13:45:33,671 ERROR
>>>> [org.ovirt.engine.core.vdsbroker.HostMonitoring]
>>>> (DefaultQuartzScheduler_Worker-37) [4e8ec41d] Exception:
>>>> org.ovirt.engine.core.vdsbroker.vdsbroker.VDSNetworkException:
>>>> VDSGenericException: VDSNetworkException: Heartbeat exeeded
>>>> at
>>>> org.ovirt.engine.core.vdsbroker.vdsbroker.BrokerCommandBase.proceedProxyReturnValue(BrokerCommandBase.java:188)
>>>>
>>>> [vdsbroker.jar:]
>>>> at
>>>> org.ovirt.engine.core.vdsbroker.vdsbroker.GetStatsVDSCommand.executeVdsBrokerCommand(GetStatsVDSCommand.java:21)
>>>>
>>>> [vdsbroker.jar:]
>>>> at
>>>> org.ovirt.engine.core.vdsbroker.vdsbroker.VdsBrokerCommand.executeVDSCommand(VdsBrokerCommand.java:110)
>>>>
>>>> [vdsbroker.jar:]
>>>> at
>>>> org.ovirt.engine.core.vdsbroker.VDSCommandBase.executeCommand(VDSCommandBase.java:65)
>>>>
>>>> [vdsbroker.jar:]
>>>> at
>>>> org.ovirt.engine.core.dal.VdcCommandBase.execute(VdcCommandBase.java:33)
>>>>
>>>> [dal.jar:]
>>>> at
>>>> org.ovirt.engine.core.vdsbroker.ResourceManager.runVdsCommand(ResourceManager.java:467)
>>>>
>>>> [vdsbroker.jar:]
>>>> at
>>>> org.ovirt.engine.core.vdsbroker.HostMonitoring.refreshVdsStats(HostMonitoring.java:472)
>>>>
>>>> [vdsbroker.jar:]
>>>> at
>>>> org.ovirt.engine.core.vdsbroker.HostMonitoring.refreshVdsRunTimeInfo(HostMonitoring.java:114)
>>>>
>>>> [vdsbroker.jar:]
>>>> at
>>>> org.ovirt.engine.core.vdsbroker.HostMonitoring.refresh(HostMonitoring.java:84)
>>>>
>>>> [vdsbroker.jar:]
>>>> at
>>>> org.ovirt.engine.core.vdsbroker.VdsManager.onTimer(VdsManager.java:227)
>>>> [vdsbroker.jar:]
>>>> at sun.reflect.GeneratedMethodAccessor75.invoke(Unknown
>>>> Source)
>>>> [:1.8.0_102]
>>>> at
>>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>>>>
>>>> [rt.jar:1.8.0_102]
>>>> at java.lang.reflect.Method.invoke(Method.java:498)
>>>> [rt.jar:1.8.0_102]
>>>> at
>>>> org.ovirt.engine.core.utils.timer.JobWrapper.invokeMethod(JobWrapper.java:81)
>>>>
>>>> [scheduler.jar:]
>>>> at
>>>> org.ovirt.engine.core.utils.timer.JobWrapper.execute(JobWrapper.java:52)
>>>>
>>>> [scheduler.jar:]
>>>> at org.quartz.core.JobRunShell.run(JobRunShell.java:213)
>>>> [quartz.jar:]
>>>> at
>>>> org.quartz.simpl.SimpleThreadPool$WorkerThread.run(SimpleThreadPool.java:557)
>>>>
>>>> [quartz.jar:]
>>>>
>>>> 2018-01-23 13:45:33,671 ERROR
>>>> [org.ovirt.engine.core.vdsbroker.HostMonitoring]
>>>> (DefaultQuartzScheduler_Worker-10) [21574461] Exception:
>>>> org.ovirt.engine.core.vdsbroker.vdsbroker.VDSNetworkException:
>>>> VDSGenericException: VDSNetworkException: Heartbeat exeeded
>>>> at
>>>> org.ovirt.engine.core.vdsbroker.vdsbroker.BrokerCommandBase.proceedProxyReturnValue(BrokerCommandBase.java:188)
>>>>
>>>> [vdsbroker.jar:]
>>>> at
>>>> org.ovirt.engine.core.vdsbroker.vdsbroker.GetStatsVDSCommand.executeVdsBrokerCommand(GetStatsVDSCommand.java:21)
>>>>
>>>> [vdsbroker.jar:]
>>>> at
>>>> org.ovirt.engine.core.vdsbroker.vdsbroker.VdsBrokerCommand.executeVDSCommand(VdsBrokerCommand.java:110)
>>>>
>>>> [vdsbroker.jar:]
>>>> at
>>>> org.ovirt.engine.core.vdsbroker.VDSCommandBase.executeCommand(VDSCommandBase.java:65)
>>>>
>>>> [vdsbroker.jar:]
>>>> at
>>>> org.ovirt.engine.core.dal.VdcCommandBase.execute(VdcCommandBase.java:33)
>>>>
>>>> [dal.jar:]
>>>> at
>>>> org.ovirt.engine.core.vdsbroker.ResourceManager.runVdsCommand(ResourceManager.java:467)
>>>>
>>>> [vdsbroker.jar:]
>>>> at
>>>> org.ovirt.engine.core.vdsbroker.HostMonitoring.refreshVdsStats(HostMonitoring.java:472)
>>>>
>>>> [vdsbroker.jar:]
>>>> at
>>>> org.ovirt.engine.core.vdsbroker.HostMonitoring.refreshVdsRunTimeInfo(HostMonitoring.java:114)
>>>>
>>>> [vdsbroker.jar:]
>>>> at
>>>> org.ovirt.engine.core.vdsbroker.HostMonitoring.refresh(HostMonitoring.java:84)
>>>>
>>>> [vdsbroker.jar:]
>>>> at
>>>> org.ovirt.engine.core.vdsbroker.VdsManager.onTimer(VdsManager.java:227)
>>>> [vdsbroker.jar:]
>>>> at sun.reflect.GeneratedMethodAccessor75.invoke(Unknown
>>>> Source)
>>>> [:1.8.0_102]
>>>> at
>>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>>>>
>>>> [rt.jar:1.8.0_102]
>>>> at java.lang.reflect.Method.invoke(Method.java:498)
>>>> [rt.jar:1.8.0_102]
>>>> at
>>>> org.ovirt.engine.core.utils.timer.JobWrapper.invokeMethod(JobWrapper.java:81)
>>>>
>>>> [scheduler.jar:]
>>>> at
>>>> org.ovirt.engine.core.utils.timer.JobWrapper.execute(JobWrapper.java:52)
>>>>
>>>> [scheduler.jar:]
>>>> at org.quartz.core.JobRunShell.run(JobRunShell.java:213)
>>>> [quartz.jar:]
>>>> at
>>>> org.quartz.simpl.SimpleThreadPool$WorkerThread.run(SimpleThreadPool.java:557)
>>>>
>>>> [quartz.jar:]
>>>>
>>>> 2018-01-23 13:45:33,671 ERROR
>>>> [org.ovirt.engine.core.vdsbroker.HostMonitoring]
>>>> (DefaultQuartzScheduler_Worker-87) [] Exception:
>>>> org.ovirt.engine.core.vdsbroker.vdsbroker.VDSNetworkException:
>>>> VDSGenericException: VDSNetworkException: Heartbeat exeeded
>>>> at
>>>> org.ovirt.engine.core.vdsbroker.vdsbroker.BrokerCommandBase.proceedProxyReturnValue(BrokerCommandBase.java:188)
>>>>
>>>> [vdsbroker.jar:]
>>>> at
>>>> org.ovirt.engine.core.vdsbroker.vdsbroker.GetStatsVDSCommand.executeVdsBrokerCommand(GetStatsVDSCommand.java:21)
>>>>
>>>> [vdsbroker.jar:]
>>>> at
>>>> org.ovirt.engine.core.vdsbroker.vdsbroker.VdsBrokerCommand.executeVDSCommand(VdsBrokerCommand.java:110)
>>>>
>>>> [vdsbroker.jar:]
>>>> at
>>>> org.ovirt.engine.core.vdsbroker.VDSCommandBase.executeCommand(VDSCommandBase.java:65)
>>>>
>>>> [vdsbroker.jar:]
>>>> at
>>>> org.ovirt.engine.core.dal.VdcCommandBase.execute(VdcCommandBase.java:33)
>>>>
>>>> [dal.jar:]
>>>> at
>>>> org.ovirt.engine.core.vdsbroker.ResourceManager.runVdsCommand(ResourceManager.java:467)
>>>>
>>>> [vdsbroker.jar:]
>>>> at
>>>> org.ovirt.engine.core.vdsbroker.HostMonitoring.refreshVdsStats(HostMonitoring.java:472)
>>>>
>>>> [vdsbroker.jar:]
>>>>
>>>>
>>>>
>>>>
>>>> Here are the engine logs show problem with node d0lppn065, the VMs
>>>> first go
>>>> to "Unknown" then then "Unknown" plus "not responding":
>>>>
>>>> 2018-01-23 14:48:00,712 ERROR
>>>> [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector]
>>>> (org.ovirt.thread.pool-8-thread-28) [] Correlation ID: null, Call
>>>> Stack:
>>>> org.ovirt.engine.core.vdsbroker.vdsbroker.VDSNetworkException:
>>>> org.ovirt.vdsm.jsonrpc.client.ClientConnection
>>>> Exception: Connection failed
>>>> at
>>>> org.ovirt.engine.core.vdsbroker.vdsbroker.VdsBrokerCommand.createNetworkException(VdsBrokerCommand.java:157)
>>>>
>>>> at
>>>> org.ovirt.engine.core.vdsbroker.vdsbroker.VdsBrokerCommand.executeVDSCommand(VdsBrokerCommand.java:120)
>>>>
>>>> at
>>>> org.ovirt.engine.core.vdsbroker.VDSCommandBase.executeCommand(VDSCommandBase.java:65)
>>>>
>>>> at
>>>> org.ovirt.engine.core.dal.VdcCommandBase.execute(VdcCommandBase.java:33)
>>>>
>>>> at
>>>> org.ovirt.engine.core.vdsbroker.ResourceManager.runVdsCommand(ResourceManager.java:467)
>>>>
>>>> at
>>>> org.ovirt.engine.core.vdsbroker.VmsStatisticsFetcher.fetch(VmsStatisticsFetcher.java:27)
>>>>
>>>> at
>>>> org.ovirt.engine.core.vdsbroker.PollVmStatsRefresher.poll(PollVmStatsRefresher.java:35)
>>>>
>>>> at sun.reflect.GeneratedMethodAccessor80.invoke(Unknown
>>>> Source)
>>>> at
>>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>>>>
>>>> at java.lang.reflect.Method.invoke(Method.java:498)
>>>> at
>>>> org.ovirt.engine.core.utils.timer.JobWrapper.invokeMethod(JobWrapper.java:81)
>>>>
>>>> at
>>>> org.ovirt.engine.core.utils.timer.JobWrapper.execute(JobWrapper.java:52)
>>>>
>>>> at org.quartz.core.JobRunShell.run(JobRunShell.java:213)
>>>> at
>>>> org.quartz.simpl.SimpleThreadPool$WorkerThread.run(SimpleThreadPool.java:557)
>>>>
>>>> Caused by: org.ovirt.vdsm.jsonrpc.client.ClientConnectionException:
>>>> Connection failed
>>>> at
>>>> org.ovirt.vdsm.jsonrpc.client.reactors.ReactorClient.connect(ReactorClient.java:155)
>>>>
>>>> at
>>>> org.ovirt.vdsm.jsonrpc.client.JsonRpcClient.getClient(JsonRpcClient.java:134)
>>>>
>>>> at
>>>> org.ovirt.vdsm.jsonrpc.client.JsonRpcClient.call(JsonRpcClient.java:81)
>>>> at
>>>> org.ovirt.engine.core.vdsbroker.jsonrpc.FutureMap.<init>(FutureMap.java:70)
>>>>
>>>> at
>>>> org.ovirt.engine.core.vdsbroker.jsonrpc.JsonRpcVdsServer.getAllVmStats(JsonRpcVdsServer.java:331)
>>>>
>>>> at
>>>> org.ovirt.engine.core.vdsbroker.vdsbroker.GetAllVmStatsVDSCommand.executeVdsBrokerCommand(GetAllVmStatsVDSCommand.java:20)
>>>>
>>>> at
>>>> org.ovirt.engine.core.vdsbroker.vdsbroker.VdsBrokerCommand.executeVDSCommand(VdsBrokerCommand.java:110)
>>>>
>>>> ... 12 more
>>>> , Custom Event ID: -1, Message: Host d0lppn065 is non responsive.
>>>> 2018-01-23 14:48:00,713 INFO
>>>> [org.ovirt.engine.core.bll.VdsEventListener]
>>>> (org.ovirt.thread.pool-8-thread-1) [] ResourceManager::vdsNotResponding
>>>> entered for Host '2797cae7-6886-4898-a5e4-23361ce03a90', '10.32.0.65'
>>>> 2018-01-23 14:48:00,713 WARN
>>>> [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector]
>>>> (org.ovirt.thread.pool-8-thread-36) [] Correlation ID: null, Call
>>>> Stack:
>>>> null, Custom Event ID: -1, Message: VM vtop3 was set to the Unknown
>>>> status.
>>>>
>>>> ...etc... (sorry about the wraps below)
>>>>
>>>> 2018-01-23 14:59:07,817 INFO
>>>> [org.ovirt.engine.core.vdsbroker.VmAnalyzer]
>>>> (DefaultQuartzScheduler_Worker-75) [] VM
>>>> '30f7af86-c2b9-41c3-b2c5-49f5bbdd0e27'(d0lpvd070) moved from 'Up' -->
>>>> 'NotResponding'
>>>> 2018-01-23 14:59:07,819 INFO
>>>> [org.ovirt.engine.core.vdsbroker.VmsStatisticsFetcher]
>>>> (DefaultQuartzScheduler_Worker-74) [] Fetched 15 VMs from VDS
>>>> '8cb119c5-b7f0-48a3-970a-205d96b2e940'
>>>> 2018-01-23 14:59:07,936 WARN
>>>> [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector]
>>>> (DefaultQuartzScheduler_Worker-75) [] Correlation ID: null, Call Stack:
>>>> null, Custom Event ID: -1, Message: VM d0lpvd070 is not responding.
>>>> 2018-01-23 14:59:07,939 INFO
>>>> [org.ovirt.engine.core.vdsbroker.VmAnalyzer]
>>>> (DefaultQuartzScheduler_Worker-75) [] VM
>>>> 'ebc5bb82-b985-451b-8313-827b5f40eaf3'(d0lpvd039) moved from 'Up' -->
>>>> 'NotResponding'
>>>> 2018-01-23 14:59:08,032 WARN
>>>> [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector]
>>>> (DefaultQuartzScheduler_Worker-75) [] Correlation ID: null, Call Stack:
>>>> null, Custom Event ID: -1, Message: VM d0lpvd039 is not responding.
>>>> 2018-01-23 14:59:08,038 INFO
>>>> [org.ovirt.engine.core.vdsbroker.VmAnalyzer]
>>>> (DefaultQuartzScheduler_Worker-75) [] VM
>>>> '494c4f9e-1616-476a-8f66-a26a96b76e56'(vtop3) moved from 'Up' -->
>>>> 'NotResponding'
>>>> 2018-01-23 14:59:08,134 WARN
>>>> [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector]
>>>> (DefaultQuartzScheduler_Worker-75) [] Correlation ID: null, Call Stack:
>>>> null, Custom Event ID: -1, Message: VM vtop3 is not responding.
>>>> 2018-01-23 14:59:08,136 INFO
>>>> [org.ovirt.engine.core.vdsbroker.VmAnalyzer]
>>>> (DefaultQuartzScheduler_Worker-75) [] VM
>>>> 'eaeaf73c-d9e2-426e-a2f2-7fcf085137b0'(d0lpvw059) moved from 'Up' -->
>>>> 'NotResponding'
>>>> 2018-01-23 14:59:08,237 WARN
>>>> [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector]
>>>> (DefaultQuartzScheduler_Worker-75) [] Correlation ID: null, Call Stack:
>>>> null, Custom Event ID: -1, Message: VM d0lpvw059 is not responding.
>>>> 2018-01-23 14:59:08,239 INFO
>>>> [org.ovirt.engine.core.vdsbroker.VmAnalyzer]
>>>> (DefaultQuartzScheduler_Worker-75) [] VM
>>>> '8308a547-37a1-4163-8170-f89b6dc85ba8'(d0lpvm058) moved from 'Up' -->
>>>> 'NotResponding'
>>>> 2018-01-23 14:59:08,326 WARN
>>>> [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector]
>>>> (DefaultQuartzScheduler_Worker-75) [] Correlation ID: null, Call Stack:
>>>> null, Custom Event ID: -1, Message: VM d0lpvm058 is not responding.
>>>> 2018-01-23 14:59:08,328 INFO
>>>> [org.ovirt.engine.core.vdsbroker.VmAnalyzer]
>>>> (DefaultQuartzScheduler_Worker-75) [] VM
>>>> '3d544926-3326-44e1-8b2a-ec632f51112a'(d0lqva056) moved from 'Up' -->
>>>> 'NotResponding'
>>>> 2018-01-23 14:59:08,400 WARN
>>>> [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector]
>>>> (DefaultQuartzScheduler_Worker-75) [] Correlation ID: null, Call Stack:
>>>> null, Custom Event ID: -1, Message: VM d0lqva056 is not responding.
>>>> 2018-01-23 14:59:08,402 INFO
>>>> [org.ovirt.engine.core.vdsbroker.VmAnalyzer]
>>>> (DefaultQuartzScheduler_Worker-75) [] VM
>>>> '989e5a17-789d-4eba-8a5e-f74846128842'(d0lpva078) moved from 'Up' -->
>>>> 'NotResponding'
>>>> 2018-01-23 14:59:08,472 WARN
>>>> [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector]
>>>> (DefaultQuartzScheduler_Worker-75) [] Correlation ID: null, Call Stack:
>>>> null, Custom Event ID: -1, Message: VM d0lpva078 is not responding.
>>>> 2018-01-23 14:59:08,474 INFO
>>>> [org.ovirt.engine.core.vdsbroker.VmAnalyzer]
>>>> (DefaultQuartzScheduler_Worker-75) [] VM
>>>> '050a71c1-9e65-43c6-bdb2-18eba571e2eb'(d0lpvw077) moved from 'Up' -->
>>>> 'NotResponding'
>>>> 2018-01-23 14:59:08,545 WARN
>>>> [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector]
>>>> (DefaultQuartzScheduler_Worker-75) [] Correlation ID: null, Call Stack:
>>>> null, Custom Event ID: -1, Message: VM d0lpvw077 is not responding.
>>>> 2018-01-23 14:59:08,547 INFO
>>>> [org.ovirt.engine.core.vdsbroker.VmAnalyzer]
>>>> (DefaultQuartzScheduler_Worker-75) [] VM
>>>> 'c3b497fd-6181-4dd1-9acf-8e32f981f769'(d0lpva079) moved from 'Up' -->
>>>> 'NotResponding'
>>>> 2018-01-23 14:59:08,621 WARN
>>>> [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector]
>>>> (DefaultQuartzScheduler_Worker-75) [] Correlation ID: null, Call Stack:
>>>> null, Custom Event ID: -1, Message: VM d0lpva079 is not responding.
>>>> 2018-01-23 14:59:08,623 INFO
>>>> [org.ovirt.engine.core.vdsbroker.VmAnalyzer]
>>>> (DefaultQuartzScheduler_Worker-75) [] VM
>>>> '7cd22b39-feb1-4c6e-8643-ac8fb0578842'(d0lqva034) moved from 'Up' -->
>>>> 'NotResponding'
>>>> 2018-01-23 14:59:08,690 WARN
>>>> [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector]
>>>> (DefaultQuartzScheduler_Worker-75) [] Correlation ID: null, Call Stack:
>>>> null, Custom Event ID: -1, Message: VM d0lqva034 is not responding.
>>>> 2018-01-23 14:59:08,692 INFO
>>>> [org.ovirt.engine.core.vdsbroker.VmAnalyzer]
>>>> (DefaultQuartzScheduler_Worker-75) [] VM
>>>> '2ab9b1d8-d1e8-4071-a47c-294e586d2fb6'(d0lpvd038) moved from 'Up' -->
>>>> 'NotResponding'
>>>> 2018-01-23 14:59:08,763 WARN
>>>> [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector]
>>>> (DefaultQuartzScheduler_Worker-75) [] Correlation ID: null, Call Stack:
>>>> null, Custom Event ID: -1, Message: VM d0lpvd038 is not responding.
>>>> 2018-01-23 14:59:08,768 INFO
>>>> [org.ovirt.engine.core.vdsbroker.VmAnalyzer]
>>>> (DefaultQuartzScheduler_Worker-75) [] VM
>>>> 'ecb4e795-9eeb-4cdc-a356-c1b9b32af5aa'(d0lqva031) moved from 'Up' -->
>>>> 'NotResponding'
>>>> 2018-01-23 14:59:08,836 WARN
>>>> [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector]
>>>> (DefaultQuartzScheduler_Worker-75) [] Correlation ID: null, Call Stack:
>>>> null, Custom Event ID: -1, Message: VM d0lqva031 is not responding.
>>>> 2018-01-23 14:59:08,838 INFO
>>>> [org.ovirt.engine.core.vdsbroker.VmAnalyzer]
>>>> (DefaultQuartzScheduler_Worker-75) [] VM
>>>> '1a361727-1607-43d9-bd22-34d45b386d3e'(d0lqva033) moved from 'Up' -->
>>>> 'NotResponding'
>>>> 2018-01-23 14:59:08,911 WARN
>>>> [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector]
>>>> (DefaultQuartzScheduler_Worker-75) [] Correlation ID: null, Call Stack:
>>>> null, Custom Event ID: -1, Message: VM d0lqva033 is not responding.
>>>> 2018-01-23 14:59:08,913 INFO
>>>> [org.ovirt.engine.core.vdsbroker.VmAnalyzer]
>>>> (DefaultQuartzScheduler_Worker-75) [] VM
>>>> '0cd65f90-719e-429e-a845-f425612d7b14'(vtop4) moved from 'Up' -->
>>>> 'NotResponding'
>>>> 2018-01-23 14:59:08,984 WARN
>>>> [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector]
>>>> (DefaultQuartzScheduler_Worker-75) [] Correlation ID: null, Call Stack:
>>>> null, Custom Event ID: -1, Message: VM vtop4 is not responding.
>>>>
>>>>>
>>>>> Probably it's time to think to upgrade your environment from 3.6.
>>>>
>>>>
>>>> I know. But from a production standpoint mid-2016 wasn't that long
>>>> ago.
>>>> And 4 was just coming out of beta at the time.
>>>>
>>>> We were upgrading from 3.4 to 3.6. And it took a long time (again,
>>>> because
>>>> it's all "live"). Trust me, the move to 4.0 was discussed, it was
>>>> just a
>>>> timing thing.
>>>>
>>>> With that said, I do "hear you"....and certainly it's being
>>>> discussed. We
>>>> just don't see a "good" migration path... we see a slow path (moving
>>>> nodes
>>>> out, upgrading, etc.) and knowing that as with all things, nobody can
>>>> guarantee "success", which would be a very bad thing. So going from
>>>> working
>>>> 3.6 to totally (potential) broken 4.2, isn't going to impress anyone
>>>> here,
>>>> you know? If all goes according to our best guesses, then great,
>>>> but when
>>>> things go bad, and the chance is not insignificant, well... I'm just
>>>> not
>>>> quite prepared with my résumé if you know what I mean.
>>>>
>>>> Don't get me wrong, our move from 3.4 to 3.6 had some similar risks,
>>>> but we
>>>> also migrated to whole new infrastructure, a luxury we will not have
>>>> this
>>>> time. And somehow 3.4 to 3.6 doesn't sound as risky as 3.6 to 4.2.
>>>
>>> I see your concern. However, keep your system updated with recent
>>> software is something I would recommend. You could setup a parallel
>>> 4.2 env and move the VMS slowly from 3.6.
>>
>> Understood. But would people want software that changes so quickly?
>> This isn't like moving from RH 7.2 to 7.3 in a matter of months, it's
>> more like moving from major release to major release in a matter of
>> months and doing again potentially in a matter of months. Granted
>> we're running oVirt and not RHV, so maybe we should be on the Fedora
>> style upgrade plan. Just not conducive to an enterprise environment
>> (oVirt people, stop laughing).
>>
>>>
>>>>
>>>> Is there a path from oVirt to RHEV? Every bit of help we get helps
>>>> us in
>>>> making that decision as well, which I think would be a very good
>>>> thing for
>>>> both of us. (I inherited all this oVirt and I was the "guy" doing
>>>> the 3.4 to
>>>> 3.6 with the all new infrastructure).
>>>
>>> Yes, you can import your setup to RHEV.
>>
>> Good to know. Because of the fragility (support wise... I'm mean our
>> oVirt has been rock solid, apart from rare glitches like this), we may
>> follow this path.
>> _______________________________________________
>> Users mailing list
>> Users at ovirt.org
>> http://lists.ovirt.org/mailman/listinfo/users
> _______________________________________________
> Users mailing list
> Users at ovirt.org
> http://lists.ovirt.org/mailman/listinfo/users
More information about the Users
mailing list