[ovirt-users] ovirt 3.6, we had the ovirt manager go down in a bad way and all VMs for one node marked Unknown and Not Reponding while up

Mon Feb 5 20:53:36 UTC 2018

Answering my own post... a restart of vdsmd on the affected blade has 
fixed everything.  Thanks everyone who helped.

On 02/05/2018 10:02 AM, Christopher Cox wrote:
> Forgive the top post.  I guess what I need to know now is whether there 
> is a recovery path that doesn't lead to total loss of the VMs that are 
> currently in the "Unknown" "Not responding" state.
> 
> We are planning a total oVirt shutdown.  I just would like to know if 
> we've effectively lot those VMs or not.  Again, the VMs are currently 
> "up".  And we use a file backup process, so in theory they can be 
> restored, just somewhat painfully, from scratch.
> 
> But if somebody knows if we shutdown all the bad VMs and the blade, is 
> there someway oVirt can know the VMs are "ok" to start up??  Will 
> changing their state directly to "down" in the db stick if the blade is 
> down?  That is, will we get to a known state where the VMs can actually 
> be started and brought back into a known state?
> 
> Right now, we're feeling there's a good chance we will not be able to 
> recover these VMs, even though they are "up" right now.  I really need 
> some way to force oVirt into an integral state, even if it means we take 
> the whole thing down.
> 
> Possible?
> 
> 
> On 01/25/2018 06:57 PM, Christopher Cox wrote:
>>
>>
>> On 01/25/2018 04:57 PM, Douglas Landgraf wrote:
>>> On Thu, Jan 25, 2018 at 5:12 PM, Christopher Cox 
>>> <ccox at endlessnow.com> wrote:
>>>> On 01/25/2018 02:25 PM, Douglas Landgraf wrote:
>>>>>
>>>>> On Wed, Jan 24, 2018 at 10:18 AM, Christopher Cox 
>>>>> <ccox at endlessnow.com>
>>>>> wrote:
>>>>>>
>>>>>> Would restarting vdsm on the node in question help fix this? 
>>>>>> Again, all
>>>>>> the
>>>>>> VMs are up on the node.  Prior attempts to fix this problem have 
>>>>>> left the
>>>>>> node in a state where I can issue the "has been rebooted" command 
>>>>>> to it,
>>>>>> it's confused.
>>>>>>
>>>>>> So... node is up.  All VMs are up.  Can't issue "has been 
>>>>>> rebooted" to
>>>>>> the
>>>>>> node, all VMs show Unknown and not responding but they are up.
>>>>>>
>>>>>> Chaning the status is the ovirt db to 0 works for a second and 
>>>>>> then it
>>>>>> goes
>>>>>> immediately back to 8 (which is why I'm wondering if I should restart
>>>>>> vdsm
>>>>>> on the node).
>>>>>
>>>>>
>>>>> It's not recommended to change db manually.
>>>>>
>>>>>>
>>>>>> Oddly enough, we're running all of this in production.  So, 
>>>>>> watching it
>>>>>> all
>>>>>> go down isn't the best option for us.
>>>>>>
>>>>>> Any advice is welcome.
>>>>>
>>>>>
>>>>>
>>>>> We would need to see the node/engine logs, have you found any error in
>>>>> the vdsm.log
>>>>> (from nodes) or engine.log? Could you please share the error?
>>>>
>>>>
>>>>
>>>> In short, the error is our ovirt manager lost network (our problem) and
>>>> crashed hard (hardware issue on the server)..  On bring up, we had some
>>>> network changes (that caused the lost network problem) so our LACP 
>>>> bond was
>>>> down for a bit while we were trying to bring it up (noting the ovirt 
>>>> manager
>>>> is up while we're reestablishing the network on the switch side).
>>>>
>>>> In other word, that's the "error" so to speak that got us to where 
>>>> we are.
>>>>
>>>> Full DEBUG enabled on the logs... The error messages seem obvious to 
>>>> me..
>>>> starts like this (nothing the ISO DOMAIN was coming off an NFS mount 
>>>> off the
>>>> ovirt management server... yes... we know... we do have plans to 
>>>> move that).
>>>>
>>>> So on the hypervisor node itself, from the vdsm.log (vdsm.log.33.xz):
>>>>
>>>> (hopefully no surprise here)
>>>>
>>>> Thread-2426633::WARNING::2018-01-23
>>>> 13:50:56,672::fileSD::749::Storage.scanDomains::(collectMetaFiles) 
>>>> Could not
>>>> collect metadata file for domain path
>>>> /rhev/data-center/mnt/d0lppc129.skopos.me:_var_lib_exports_iso-20160408002844 
>>>>
>>>> Traceback (most recent call last):
>>>>    File "/usr/share/vdsm/storage/fileSD.py", line 735, in 
>>>> collectMetaFiles
>>>>      sd.DOMAIN_META_DATA))
>>>>    File "/usr/share/vdsm/storage/outOfProcess.py", line 121, in glob
>>>>      return self._iop.glob(pattern)
>>>>    File "/usr/lib/python2.7/site-packages/ioprocess/__init__.py", 
>>>> line 536,
>>>> in glob
>>>>      return self._sendCommand("glob", {"pattern": pattern}, 
>>>> self.timeout)
>>>>    File "/usr/lib/python2.7/site-packages/ioprocess/__init__.py", 
>>>> line 421,
>>>> in _sendCommand
>>>>      raise Timeout(os.strerror(errno.ETIMEDOUT))
>>>> Timeout: Connection timed out
>>>> Thread-27::ERROR::2018-01-23
>>>> 13:50:56,672::sdc::145::Storage.StorageDomainCache::(_findDomain) 
>>>> domain
>>>> e5ecae2f-5a06-4743-9a43-e74d83992c35 not found
>>>> Traceback (most recent call last):
>>>>    File "/usr/share/vdsm/storage/sdc.py", line 143, in _findDomain
>>>>      dom = findMethod(sdUUID)
>>>>    File "/usr/share/vdsm/storage/nfsSD.py", line 122, in findDomain
>>>>      return NfsStorageDomain(NfsStorageDomain.findDomainPath(sdUUID))
>>>>    File "/usr/share/vdsm/storage/nfsSD.py", line 112, in findDomainPath
>>>>      raise se.StorageDomainDoesNotExist(sdUUID)
>>>> StorageDomainDoesNotExist: Storage domain does not exist:
>>>> (u'e5ecae2f-5a06-4743-9a43-e74d83992c35',)
>>>> Thread-27::ERROR::2018-01-23
>>>> 13:50:56,673::monitor::276::Storage.Monitor::(_monitorDomain) Error
>>>> monitoring domain e5ecae2f-5a06-4743-9a43-e74d83992c35
>>>> Traceback (most recent call last):
>>>>    File "/usr/share/vdsm/storage/monitor.py", line 272, in 
>>>> _monitorDomain
>>>>      self._performDomainSelftest()
>>>>    File "/usr/lib/python2.7/site-packages/vdsm/utils.py", line 769, in
>>>> wrapper
>>>>      value = meth(self, *a, **kw)
>>>>    File "/usr/share/vdsm/storage/monitor.py", line 339, in
>>>> _performDomainSelftest
>>>>      self.domain.selftest()
>>>>    File "/usr/share/vdsm/storage/sdc.py", line 49, in __getattr__
>>>>      return getattr(self.getRealDomain(), attrName)
>>>>    File "/usr/share/vdsm/storage/sdc.py", line 52, in getRealDomain
>>>>      return self._cache._realProduce(self._sdUUID)
>>>>    File "/usr/share/vdsm/storage/sdc.py", line 124, in _realProduce
>>>>      domain = self._findDomain(sdUUID)
>>>>    File "/usr/share/vdsm/storage/sdc.py", line 143, in _findDomain
>>>>      dom = findMethod(sdUUID)
>>>>    File "/usr/share/vdsm/storage/nfsSD.py", line 122, in findDomain
>>>>      return NfsStorageDomain(NfsStorageDomain.findDomainPath(sdUUID))
>>>>    File "/usr/share/vdsm/storage/nfsSD.py", line 112, in findDomainPath
>>>>      raise se.StorageDomainDoesNotExist(sdUUID)
>>>> StorageDomainDoesNotExist: Storage domain does not exist:
>>>> (u'e5ecae2f-5a06-4743-9a43-e74d83992c35',)
>>>>
>>>>
>>>> Again, all the hypervisor nodes will complain about having the NFS 
>>>> area for
>>>> ISO DOMAIN now gone.  Remember the ovirt manager node held this and 
>>>> it has
>>>> now network has gone out and the node crashed (note: the ovirt node 
>>>> (the
>>>> actual server box) shouldn't crash due to the network outage, but it 
>>>> did.
>>>
>>>
>>> I have added VDSM people in this thread to review it. I am assuming
>>> the network changes (during the crash) still make the storage domain
>>> available for the nodes.
>>
>> Ideally, nothing was lost node wise (neither LAN nor iSCSI), just the 
>> ovirt manager lost its network connection.  So the only thing, as I 
>> mentioned, storage wise that was lost was the ISO DOMAIN which was 
>> NFS'd off the ovirt manager.
>>
>>>
>>>>
>>>> So here is the engine collapse as it lost network connectivity 
>>>> (before the
>>>> server actually crashed hard).
>>>>
>>>> 2018-01-23 13:45:33,666 ERROR
>>>> [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector]
>>>> (DefaultQuartzScheduler_Worker-87) [] Correlation ID: null, Call Stack:
>>>> null, Custom Event ID: -1, Message: VDSM d0lppn067 command failed: 
>>>> Heartbeat
>>>> exeeded
>>>> 2018-01-23 13:45:33,666 ERROR
>>>> [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector]
>>>> (DefaultQuartzScheduler_Worker-10) [21574461] Correlation ID: null, 
>>>> Call
>>>> Stack: null, Custom Event ID: -1, Message: VDSM d0lppn072 command 
>>>> failed:
>>>> Heartbeat exeeded
>>>> 2018-01-23 13:45:33,666 ERROR
>>>> [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector]
>>>> (DefaultQuartzScheduler_Worker-37) [4e8ec41d] Correlation ID: null, 
>>>> Call
>>>> Stack: null, Custom Event ID: -1, Message: VDSM d0lppn066 command 
>>>> failed:
>>>> Heartbeat exeeded
>>>> 2018-01-23 13:45:33,667 ERROR
>>>> [org.ovirt.engine.core.vdsbroker.vdsbroker.GetStatsVDSCommand]
>>>> (DefaultQuartzScheduler_Worker-87) [] Command 
>>>> 'GetStatsVDSCommand(HostName =
>>>> d0lppn067, VdsIdAndVdsVDSCommandParametersBase:{runAsync='true',
>>>> hostId='f99c68c8-b0e8-437b-8cd9-ebaddaaede96',
>>>> vds='Host[d0lppn067,f99c68c8-b0e8-437b-8cd9-ebaddaaede96]'})' execution
>>>> failed: VDSGenericException: VDSNetworkException: Heartbeat exeeded
>>>> 2018-01-23 13:45:33,667 ERROR
>>>> [org.ovirt.engine.core.vdsbroker.vdsbroker.GetStatsVDSCommand]
>>>> (DefaultQuartzScheduler_Worker-10) [21574461] Command
>>>> 'GetStatsVDSCommand(HostName = d0lppn072,
>>>> VdsIdAndVdsVDSCommandParametersBase:{runAsync='true',
>>>> hostId='fdc00296-973d-4268-bd79-6dac535974e0',
>>>> vds='Host[d0lppn072,fdc00296-973d-4268-bd79-6dac535974e0]'})' execution
>>>> failed: VDSGenericException: VDSNetworkException: Heartbeat exeeded
>>>> 2018-01-23 13:45:33,667 ERROR
>>>> [org.ovirt.engine.core.vdsbroker.vdsbroker.GetStatsVDSCommand]
>>>> (DefaultQuartzScheduler_Worker-37) [4e8ec41d] Command
>>>> 'GetStatsVDSCommand(HostName = d0lppn066,
>>>> VdsIdAndVdsVDSCommandParametersBase:{runAsync='true',
>>>> hostId='14abf559-4b62-4ebd-a345-77fa9e1fa3ae',
>>>> vds='Host[d0lppn066,14abf559-4b62-4ebd-a345-77fa9e1fa3ae]'})' execution
>>>> failed: VDSGenericException: VDSNetworkException: Heartbeat exeeded
>>>> 2018-01-23 13:45:33,669 ERROR
>>>> [org.ovirt.engine.core.vdsbroker.HostMonitoring]
>>>> (DefaultQuartzScheduler_Worker-87) []  Failed getting vds stats,
>>>> vds='d0lppn067'(f99c68c8-b0e8-437b-8cd9-ebaddaaede96):
>>>> org.ovirt.engine.core.vdsbroker.vdsbroker.VDSNetworkException:
>>>> VDSGenericException: VDSNetworkException: Heartbeat exeeded
>>>> 2018-01-23 13:45:33,669 ERROR
>>>> [org.ovirt.engine.core.vdsbroker.HostMonitoring]
>>>> (DefaultQuartzScheduler_Worker-10) [21574461]  Failed getting vds 
>>>> stats,
>>>> vds='d0lppn072'(fdc00296-973d-4268-bd79-6dac535974e0):
>>>> org.ovirt.engine.core.vdsbroker.vdsbroker.VDSNetworkException:
>>>> VDSGenericException: VDSNetworkException: Heartbeat exeeded
>>>> 2018-01-23 13:45:33,669 ERROR
>>>> [org.ovirt.engine.core.vdsbroker.HostMonitoring]
>>>> (DefaultQuartzScheduler_Worker-37) [4e8ec41d]  Failed getting vds 
>>>> stats,
>>>> vds='d0lppn066'(14abf559-4b62-4ebd-a345-77fa9e1fa3ae):
>>>> org.ovirt.engine.core.vdsbroker.vdsbroker.VDSNetworkException:
>>>> VDSGenericException: VDSNetworkException: Heartbeat exeeded
>>>> 2018-01-23 13:45:33,671 ERROR
>>>> [org.ovirt.engine.core.vdsbroker.HostMonitoring]
>>>> (DefaultQuartzScheduler_Worker-10) [21574461] Failure to refresh Vds 
>>>> runtime
>>>> info: VDSGenericException: VDSNetworkException: Heartbeat exeeded
>>>> 2018-01-23 13:45:33,671 ERROR
>>>> [org.ovirt.engine.core.vdsbroker.HostMonitoring]
>>>> (DefaultQuartzScheduler_Worker-37) [4e8ec41d] Failure to refresh Vds 
>>>> runtime
>>>> info: VDSGenericException: VDSNetworkException: Heartbeat exeeded
>>>> 2018-01-23 13:45:33,671 ERROR
>>>> [org.ovirt.engine.core.vdsbroker.HostMonitoring]
>>>> (DefaultQuartzScheduler_Worker-87) [] Failure to refresh Vds runtime 
>>>> info:
>>>> VDSGenericException: VDSNetworkException: Heartbeat exeeded
>>>> 2018-01-23 13:45:33,671 ERROR
>>>> [org.ovirt.engine.core.vdsbroker.HostMonitoring]
>>>> (DefaultQuartzScheduler_Worker-37) [4e8ec41d] Exception:
>>>> org.ovirt.engine.core.vdsbroker.vdsbroker.VDSNetworkException:
>>>> VDSGenericException: VDSNetworkException: Heartbeat exeeded
>>>>          at
>>>> org.ovirt.engine.core.vdsbroker.vdsbroker.BrokerCommandBase.proceedProxyReturnValue(BrokerCommandBase.java:188) 
>>>>
>>>> [vdsbroker.jar:]
>>>>          at
>>>> org.ovirt.engine.core.vdsbroker.vdsbroker.GetStatsVDSCommand.executeVdsBrokerCommand(GetStatsVDSCommand.java:21) 
>>>>
>>>> [vdsbroker.jar:]
>>>>          at
>>>> org.ovirt.engine.core.vdsbroker.vdsbroker.VdsBrokerCommand.executeVDSCommand(VdsBrokerCommand.java:110) 
>>>>
>>>> [vdsbroker.jar:]
>>>>          at
>>>> org.ovirt.engine.core.vdsbroker.VDSCommandBase.executeCommand(VDSCommandBase.java:65) 
>>>>
>>>> [vdsbroker.jar:]
>>>>          at
>>>> org.ovirt.engine.core.dal.VdcCommandBase.execute(VdcCommandBase.java:33) 
>>>>
>>>> [dal.jar:]
>>>>          at
>>>> org.ovirt.engine.core.vdsbroker.ResourceManager.runVdsCommand(ResourceManager.java:467) 
>>>>
>>>> [vdsbroker.jar:]
>>>>          at
>>>> org.ovirt.engine.core.vdsbroker.HostMonitoring.refreshVdsStats(HostMonitoring.java:472) 
>>>>
>>>> [vdsbroker.jar:]
>>>>          at
>>>> org.ovirt.engine.core.vdsbroker.HostMonitoring.refreshVdsRunTimeInfo(HostMonitoring.java:114) 
>>>>
>>>> [vdsbroker.jar:]
>>>>          at
>>>> org.ovirt.engine.core.vdsbroker.HostMonitoring.refresh(HostMonitoring.java:84) 
>>>>
>>>> [vdsbroker.jar:]
>>>>          at
>>>> org.ovirt.engine.core.vdsbroker.VdsManager.onTimer(VdsManager.java:227)
>>>> [vdsbroker.jar:]
>>>>          at sun.reflect.GeneratedMethodAccessor75.invoke(Unknown 
>>>> Source)
>>>> [:1.8.0_102]
>>>>          at
>>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) 
>>>>
>>>> [rt.jar:1.8.0_102]
>>>>          at java.lang.reflect.Method.invoke(Method.java:498)
>>>> [rt.jar:1.8.0_102]
>>>>          at
>>>> org.ovirt.engine.core.utils.timer.JobWrapper.invokeMethod(JobWrapper.java:81) 
>>>>
>>>> [scheduler.jar:]
>>>>          at
>>>> org.ovirt.engine.core.utils.timer.JobWrapper.execute(JobWrapper.java:52) 
>>>>
>>>> [scheduler.jar:]
>>>>          at org.quartz.core.JobRunShell.run(JobRunShell.java:213)
>>>> [quartz.jar:]
>>>>          at
>>>> org.quartz.simpl.SimpleThreadPool$WorkerThread.run(SimpleThreadPool.java:557) 
>>>>
>>>> [quartz.jar:]
>>>>
>>>> 2018-01-23 13:45:33,671 ERROR
>>>> [org.ovirt.engine.core.vdsbroker.HostMonitoring]
>>>> (DefaultQuartzScheduler_Worker-10) [21574461] Exception:
>>>> org.ovirt.engine.core.vdsbroker.vdsbroker.VDSNetworkException:
>>>> VDSGenericException: VDSNetworkException: Heartbeat exeeded
>>>>          at
>>>> org.ovirt.engine.core.vdsbroker.vdsbroker.BrokerCommandBase.proceedProxyReturnValue(BrokerCommandBase.java:188) 
>>>>
>>>> [vdsbroker.jar:]
>>>>          at
>>>> org.ovirt.engine.core.vdsbroker.vdsbroker.GetStatsVDSCommand.executeVdsBrokerCommand(GetStatsVDSCommand.java:21) 
>>>>
>>>> [vdsbroker.jar:]
>>>>          at
>>>> org.ovirt.engine.core.vdsbroker.vdsbroker.VdsBrokerCommand.executeVDSCommand(VdsBrokerCommand.java:110) 
>>>>
>>>> [vdsbroker.jar:]
>>>>          at
>>>> org.ovirt.engine.core.vdsbroker.VDSCommandBase.executeCommand(VDSCommandBase.java:65) 
>>>>
>>>> [vdsbroker.jar:]
>>>>          at
>>>> org.ovirt.engine.core.dal.VdcCommandBase.execute(VdcCommandBase.java:33) 
>>>>
>>>> [dal.jar:]
>>>>          at
>>>> org.ovirt.engine.core.vdsbroker.ResourceManager.runVdsCommand(ResourceManager.java:467) 
>>>>
>>>> [vdsbroker.jar:]
>>>>          at
>>>> org.ovirt.engine.core.vdsbroker.HostMonitoring.refreshVdsStats(HostMonitoring.java:472) 
>>>>
>>>> [vdsbroker.jar:]
>>>>          at
>>>> org.ovirt.engine.core.vdsbroker.HostMonitoring.refreshVdsRunTimeInfo(HostMonitoring.java:114) 
>>>>
>>>> [vdsbroker.jar:]
>>>>          at
>>>> org.ovirt.engine.core.vdsbroker.HostMonitoring.refresh(HostMonitoring.java:84) 
>>>>
>>>> [vdsbroker.jar:]
>>>>          at
>>>> org.ovirt.engine.core.vdsbroker.VdsManager.onTimer(VdsManager.java:227)
>>>> [vdsbroker.jar:]
>>>>          at sun.reflect.GeneratedMethodAccessor75.invoke(Unknown 
>>>> Source)
>>>> [:1.8.0_102]
>>>>          at
>>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) 
>>>>
>>>> [rt.jar:1.8.0_102]
>>>>          at java.lang.reflect.Method.invoke(Method.java:498)
>>>> [rt.jar:1.8.0_102]
>>>>          at
>>>> org.ovirt.engine.core.utils.timer.JobWrapper.invokeMethod(JobWrapper.java:81) 
>>>>
>>>> [scheduler.jar:]
>>>>          at
>>>> org.ovirt.engine.core.utils.timer.JobWrapper.execute(JobWrapper.java:52) 
>>>>
>>>> [scheduler.jar:]
>>>>          at org.quartz.core.JobRunShell.run(JobRunShell.java:213)
>>>> [quartz.jar:]
>>>>          at
>>>> org.quartz.simpl.SimpleThreadPool$WorkerThread.run(SimpleThreadPool.java:557) 
>>>>
>>>> [quartz.jar:]
>>>>
>>>> 2018-01-23 13:45:33,671 ERROR
>>>> [org.ovirt.engine.core.vdsbroker.HostMonitoring]
>>>> (DefaultQuartzScheduler_Worker-87) [] Exception:
>>>> org.ovirt.engine.core.vdsbroker.vdsbroker.VDSNetworkException:
>>>> VDSGenericException: VDSNetworkException: Heartbeat exeeded
>>>>          at
>>>> org.ovirt.engine.core.vdsbroker.vdsbroker.BrokerCommandBase.proceedProxyReturnValue(BrokerCommandBase.java:188) 
>>>>
>>>> [vdsbroker.jar:]
>>>>          at
>>>> org.ovirt.engine.core.vdsbroker.vdsbroker.GetStatsVDSCommand.executeVdsBrokerCommand(GetStatsVDSCommand.java:21) 
>>>>
>>>> [vdsbroker.jar:]
>>>>          at
>>>> org.ovirt.engine.core.vdsbroker.vdsbroker.VdsBrokerCommand.executeVDSCommand(VdsBrokerCommand.java:110) 
>>>>
>>>> [vdsbroker.jar:]
>>>>          at
>>>> org.ovirt.engine.core.vdsbroker.VDSCommandBase.executeCommand(VDSCommandBase.java:65) 
>>>>
>>>> [vdsbroker.jar:]
>>>>          at
>>>> org.ovirt.engine.core.dal.VdcCommandBase.execute(VdcCommandBase.java:33) 
>>>>
>>>> [dal.jar:]
>>>>          at
>>>> org.ovirt.engine.core.vdsbroker.ResourceManager.runVdsCommand(ResourceManager.java:467) 
>>>>
>>>> [vdsbroker.jar:]
>>>>          at
>>>> org.ovirt.engine.core.vdsbroker.HostMonitoring.refreshVdsStats(HostMonitoring.java:472) 
>>>>
>>>> [vdsbroker.jar:]
>>>>
>>>>
>>>>
>>>>
>>>> Here are the engine logs show problem with node d0lppn065, the VMs 
>>>> first go
>>>> to "Unknown" then then "Unknown" plus "not responding":
>>>>
>>>> 2018-01-23 14:48:00,712 ERROR
>>>> [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector]
>>>> (org.ovirt.thread.pool-8-thread-28) [] Correlation ID: null, Call 
>>>> Stack:
>>>> org.ovirt.engine.core.vdsbroker.vdsbroker.VDSNetworkException:
>>>> org.ovirt.vdsm.jsonrpc.client.ClientConnection
>>>> Exception: Connection failed
>>>>          at
>>>> org.ovirt.engine.core.vdsbroker.vdsbroker.VdsBrokerCommand.createNetworkException(VdsBrokerCommand.java:157) 
>>>>
>>>>          at
>>>> org.ovirt.engine.core.vdsbroker.vdsbroker.VdsBrokerCommand.executeVDSCommand(VdsBrokerCommand.java:120) 
>>>>
>>>>          at
>>>> org.ovirt.engine.core.vdsbroker.VDSCommandBase.executeCommand(VDSCommandBase.java:65) 
>>>>
>>>>          at
>>>> org.ovirt.engine.core.dal.VdcCommandBase.execute(VdcCommandBase.java:33) 
>>>>
>>>>          at
>>>> org.ovirt.engine.core.vdsbroker.ResourceManager.runVdsCommand(ResourceManager.java:467) 
>>>>
>>>>          at
>>>> org.ovirt.engine.core.vdsbroker.VmsStatisticsFetcher.fetch(VmsStatisticsFetcher.java:27) 
>>>>
>>>>          at
>>>> org.ovirt.engine.core.vdsbroker.PollVmStatsRefresher.poll(PollVmStatsRefresher.java:35) 
>>>>
>>>>          at sun.reflect.GeneratedMethodAccessor80.invoke(Unknown 
>>>> Source)
>>>>          at
>>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) 
>>>>
>>>>          at java.lang.reflect.Method.invoke(Method.java:498)
>>>>          at
>>>> org.ovirt.engine.core.utils.timer.JobWrapper.invokeMethod(JobWrapper.java:81) 
>>>>
>>>>          at
>>>> org.ovirt.engine.core.utils.timer.JobWrapper.execute(JobWrapper.java:52) 
>>>>
>>>>          at org.quartz.core.JobRunShell.run(JobRunShell.java:213)
>>>>          at
>>>> org.quartz.simpl.SimpleThreadPool$WorkerThread.run(SimpleThreadPool.java:557) 
>>>>
>>>> Caused by: org.ovirt.vdsm.jsonrpc.client.ClientConnectionException:
>>>> Connection failed
>>>>          at
>>>> org.ovirt.vdsm.jsonrpc.client.reactors.ReactorClient.connect(ReactorClient.java:155) 
>>>>
>>>>          at
>>>> org.ovirt.vdsm.jsonrpc.client.JsonRpcClient.getClient(JsonRpcClient.java:134) 
>>>>
>>>>          at
>>>> org.ovirt.vdsm.jsonrpc.client.JsonRpcClient.call(JsonRpcClient.java:81)
>>>>          at
>>>> org.ovirt.engine.core.vdsbroker.jsonrpc.FutureMap.<init>(FutureMap.java:70) 
>>>>
>>>>          at
>>>> org.ovirt.engine.core.vdsbroker.jsonrpc.JsonRpcVdsServer.getAllVmStats(JsonRpcVdsServer.java:331) 
>>>>
>>>>          at
>>>> org.ovirt.engine.core.vdsbroker.vdsbroker.GetAllVmStatsVDSCommand.executeVdsBrokerCommand(GetAllVmStatsVDSCommand.java:20) 
>>>>
>>>>          at
>>>> org.ovirt.engine.core.vdsbroker.vdsbroker.VdsBrokerCommand.executeVDSCommand(VdsBrokerCommand.java:110) 
>>>>
>>>>          ... 12 more
>>>> , Custom Event ID: -1, Message: Host d0lppn065 is non responsive.
>>>> 2018-01-23 14:48:00,713 INFO 
>>>> [org.ovirt.engine.core.bll.VdsEventListener]
>>>> (org.ovirt.thread.pool-8-thread-1) [] ResourceManager::vdsNotResponding
>>>> entered for Host '2797cae7-6886-4898-a5e4-23361ce03a90', '10.32.0.65'
>>>> 2018-01-23 14:48:00,713 WARN
>>>> [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector]
>>>> (org.ovirt.thread.pool-8-thread-36) [] Correlation ID: null, Call 
>>>> Stack:
>>>> null, Custom Event ID: -1, Message: VM vtop3 was set to the Unknown 
>>>> status.
>>>>
>>>> ...etc... (sorry about the wraps below)
>>>>
>>>> 2018-01-23 14:59:07,817 INFO 
>>>> [org.ovirt.engine.core.vdsbroker.VmAnalyzer]
>>>> (DefaultQuartzScheduler_Worker-75) [] VM
>>>> '30f7af86-c2b9-41c3-b2c5-49f5bbdd0e27'(d0lpvd070) moved from 'Up' -->
>>>> 'NotResponding'
>>>> 2018-01-23 14:59:07,819 INFO
>>>> [org.ovirt.engine.core.vdsbroker.VmsStatisticsFetcher]
>>>> (DefaultQuartzScheduler_Worker-74) [] Fetched 15 VMs from VDS
>>>> '8cb119c5-b7f0-48a3-970a-205d96b2e940'
>>>> 2018-01-23 14:59:07,936 WARN
>>>> [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector]
>>>> (DefaultQuartzScheduler_Worker-75) [] Correlation ID: null, Call Stack:
>>>> null, Custom Event ID: -1, Message: VM d0lpvd070 is not responding.
>>>> 2018-01-23 14:59:07,939 INFO 
>>>> [org.ovirt.engine.core.vdsbroker.VmAnalyzer]
>>>> (DefaultQuartzScheduler_Worker-75) [] VM
>>>> 'ebc5bb82-b985-451b-8313-827b5f40eaf3'(d0lpvd039) moved from 'Up' -->
>>>> 'NotResponding'
>>>> 2018-01-23 14:59:08,032 WARN
>>>> [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector]
>>>> (DefaultQuartzScheduler_Worker-75) [] Correlation ID: null, Call Stack:
>>>> null, Custom Event ID: -1, Message: VM d0lpvd039 is not responding.
>>>> 2018-01-23 14:59:08,038 INFO 
>>>> [org.ovirt.engine.core.vdsbroker.VmAnalyzer]
>>>> (DefaultQuartzScheduler_Worker-75) [] VM
>>>> '494c4f9e-1616-476a-8f66-a26a96b76e56'(vtop3) moved from 'Up' -->
>>>> 'NotResponding'
>>>> 2018-01-23 14:59:08,134 WARN
>>>> [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector]
>>>> (DefaultQuartzScheduler_Worker-75) [] Correlation ID: null, Call Stack:
>>>> null, Custom Event ID: -1, Message: VM vtop3 is not responding.
>>>> 2018-01-23 14:59:08,136 INFO 
>>>> [org.ovirt.engine.core.vdsbroker.VmAnalyzer]
>>>> (DefaultQuartzScheduler_Worker-75) [] VM
>>>> 'eaeaf73c-d9e2-426e-a2f2-7fcf085137b0'(d0lpvw059) moved from 'Up' -->
>>>> 'NotResponding'
>>>> 2018-01-23 14:59:08,237 WARN
>>>> [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector]
>>>> (DefaultQuartzScheduler_Worker-75) [] Correlation ID: null, Call Stack:
>>>> null, Custom Event ID: -1, Message: VM d0lpvw059 is not responding.
>>>> 2018-01-23 14:59:08,239 INFO 
>>>> [org.ovirt.engine.core.vdsbroker.VmAnalyzer]
>>>> (DefaultQuartzScheduler_Worker-75) [] VM
>>>> '8308a547-37a1-4163-8170-f89b6dc85ba8'(d0lpvm058) moved from 'Up' -->
>>>> 'NotResponding'
>>>> 2018-01-23 14:59:08,326 WARN
>>>> [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector]
>>>> (DefaultQuartzScheduler_Worker-75) [] Correlation ID: null, Call Stack:
>>>> null, Custom Event ID: -1, Message: VM d0lpvm058 is not responding.
>>>> 2018-01-23 14:59:08,328 INFO 
>>>> [org.ovirt.engine.core.vdsbroker.VmAnalyzer]
>>>> (DefaultQuartzScheduler_Worker-75) [] VM
>>>> '3d544926-3326-44e1-8b2a-ec632f51112a'(d0lqva056) moved from 'Up' -->
>>>> 'NotResponding'
>>>> 2018-01-23 14:59:08,400 WARN
>>>> [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector]
>>>> (DefaultQuartzScheduler_Worker-75) [] Correlation ID: null, Call Stack:
>>>> null, Custom Event ID: -1, Message: VM d0lqva056 is not responding.
>>>> 2018-01-23 14:59:08,402 INFO 
>>>> [org.ovirt.engine.core.vdsbroker.VmAnalyzer]
>>>> (DefaultQuartzScheduler_Worker-75) [] VM
>>>> '989e5a17-789d-4eba-8a5e-f74846128842'(d0lpva078) moved from 'Up' -->
>>>> 'NotResponding'
>>>> 2018-01-23 14:59:08,472 WARN
>>>> [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector]
>>>> (DefaultQuartzScheduler_Worker-75) [] Correlation ID: null, Call Stack:
>>>> null, Custom Event ID: -1, Message: VM d0lpva078 is not responding.
>>>> 2018-01-23 14:59:08,474 INFO 
>>>> [org.ovirt.engine.core.vdsbroker.VmAnalyzer]
>>>> (DefaultQuartzScheduler_Worker-75) [] VM
>>>> '050a71c1-9e65-43c6-bdb2-18eba571e2eb'(d0lpvw077) moved from 'Up' -->
>>>> 'NotResponding'
>>>> 2018-01-23 14:59:08,545 WARN
>>>> [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector]
>>>> (DefaultQuartzScheduler_Worker-75) [] Correlation ID: null, Call Stack:
>>>> null, Custom Event ID: -1, Message: VM d0lpvw077 is not responding.
>>>> 2018-01-23 14:59:08,547 INFO 
>>>> [org.ovirt.engine.core.vdsbroker.VmAnalyzer]
>>>> (DefaultQuartzScheduler_Worker-75) [] VM
>>>> 'c3b497fd-6181-4dd1-9acf-8e32f981f769'(d0lpva079) moved from 'Up' -->
>>>> 'NotResponding'
>>>> 2018-01-23 14:59:08,621 WARN
>>>> [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector]
>>>> (DefaultQuartzScheduler_Worker-75) [] Correlation ID: null, Call Stack:
>>>> null, Custom Event ID: -1, Message: VM d0lpva079 is not responding.
>>>> 2018-01-23 14:59:08,623 INFO 
>>>> [org.ovirt.engine.core.vdsbroker.VmAnalyzer]
>>>> (DefaultQuartzScheduler_Worker-75) [] VM
>>>> '7cd22b39-feb1-4c6e-8643-ac8fb0578842'(d0lqva034) moved from 'Up' -->
>>>> 'NotResponding'
>>>> 2018-01-23 14:59:08,690 WARN
>>>> [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector]
>>>> (DefaultQuartzScheduler_Worker-75) [] Correlation ID: null, Call Stack:
>>>> null, Custom Event ID: -1, Message: VM d0lqva034 is not responding.
>>>> 2018-01-23 14:59:08,692 INFO 
>>>> [org.ovirt.engine.core.vdsbroker.VmAnalyzer]
>>>> (DefaultQuartzScheduler_Worker-75) [] VM
>>>> '2ab9b1d8-d1e8-4071-a47c-294e586d2fb6'(d0lpvd038) moved from 'Up' -->
>>>> 'NotResponding'
>>>> 2018-01-23 14:59:08,763 WARN
>>>> [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector]
>>>> (DefaultQuartzScheduler_Worker-75) [] Correlation ID: null, Call Stack:
>>>> null, Custom Event ID: -1, Message: VM d0lpvd038 is not responding.
>>>> 2018-01-23 14:59:08,768 INFO 
>>>> [org.ovirt.engine.core.vdsbroker.VmAnalyzer]
>>>> (DefaultQuartzScheduler_Worker-75) [] VM
>>>> 'ecb4e795-9eeb-4cdc-a356-c1b9b32af5aa'(d0lqva031) moved from 'Up' -->
>>>> 'NotResponding'
>>>> 2018-01-23 14:59:08,836 WARN
>>>> [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector]
>>>> (DefaultQuartzScheduler_Worker-75) [] Correlation ID: null, Call Stack:
>>>> null, Custom Event ID: -1, Message: VM d0lqva031 is not responding.
>>>> 2018-01-23 14:59:08,838 INFO 
>>>> [org.ovirt.engine.core.vdsbroker.VmAnalyzer]
>>>> (DefaultQuartzScheduler_Worker-75) [] VM
>>>> '1a361727-1607-43d9-bd22-34d45b386d3e'(d0lqva033) moved from 'Up' -->
>>>> 'NotResponding'
>>>> 2018-01-23 14:59:08,911 WARN
>>>> [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector]
>>>> (DefaultQuartzScheduler_Worker-75) [] Correlation ID: null, Call Stack:
>>>> null, Custom Event ID: -1, Message: VM d0lqva033 is not responding.
>>>> 2018-01-23 14:59:08,913 INFO 
>>>> [org.ovirt.engine.core.vdsbroker.VmAnalyzer]
>>>> (DefaultQuartzScheduler_Worker-75) [] VM
>>>> '0cd65f90-719e-429e-a845-f425612d7b14'(vtop4) moved from 'Up' -->
>>>> 'NotResponding'
>>>> 2018-01-23 14:59:08,984 WARN
>>>> [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector]
>>>> (DefaultQuartzScheduler_Worker-75) [] Correlation ID: null, Call Stack:
>>>> null, Custom Event ID: -1, Message: VM vtop4 is not responding.
>>>>
>>>>>
>>>>> Probably it's time to think to upgrade your environment from 3.6.
>>>>
>>>>
>>>> I know.  But from a production standpoint mid-2016 wasn't that long 
>>>> ago.
>>>> And 4 was just coming out of beta at the time.
>>>>
>>>> We were upgrading from 3.4 to 3.6.  And it took a long time (again, 
>>>> because
>>>> it's all "live").  Trust me, the move to 4.0 was discussed, it was 
>>>> just a
>>>> timing thing.
>>>>
>>>> With that said, I do "hear you"....and certainly it's being 
>>>> discussed. We
>>>> just don't see a "good" migration path... we see a slow path (moving 
>>>> nodes
>>>> out, upgrading, etc.) and knowing that as with all things, nobody can
>>>> guarantee "success", which would be a very bad thing.  So going from 
>>>> working
>>>> 3.6 to totally (potential) broken 4.2, isn't going to impress anyone 
>>>> here,
>>>> you know?  If all goes according to our best guesses, then great, 
>>>> but when
>>>> things go bad, and the chance is not insignificant, well... I'm just 
>>>> not
>>>> quite prepared with my résumé if you know what I mean.
>>>>
>>>> Don't get me wrong, our move from 3.4 to 3.6 had some similar risks, 
>>>> but we
>>>> also migrated to whole new infrastructure, a luxury we will not have 
>>>> this
>>>> time.  And somehow 3.4 to 3.6 doesn't sound as risky as 3.6 to 4.2.
>>>
>>> I see your concern. However,  keep your system updated with recent
>>> software is something I would recommend. You could setup a parallel
>>> 4.2 env and move the VMS slowly from 3.6.
>>
>> Understood.  But would people want software that changes so quickly? 
>> This isn't like moving from RH 7.2 to 7.3 in a matter of months, it's 
>> more like moving from major release to major release in a matter of 
>> months and doing again potentially in a matter of months.  Granted 
>> we're running oVirt and not RHV, so maybe we should be on the Fedora 
>> style upgrade plan.  Just not conducive to an enterprise environment 
>> (oVirt people, stop laughing).
>>
>>>
>>>>
>>>> Is there a path from oVirt to RHEV?  Every bit of help we get helps 
>>>> us in
>>>> making that decision as well, which I think would be a very good 
>>>> thing for
>>>> both of us. (I inherited all this oVirt and I was the "guy" doing 
>>>> the 3.4 to
>>>> 3.6 with the all new infrastructure).
>>>
>>> Yes, you can import your setup to RHEV.
>>
>> Good to know. Because of the fragility (support wise... I'm mean our 
>> oVirt has been rock solid, apart from rare glitches like this), we may 
>> follow this path.
>> _______________________________________________
>> Users mailing list
>> Users at ovirt.org
>> http://lists.ovirt.org/mailman/listinfo/users
> _______________________________________________
> Users mailing list
> Users at ovirt.org
> http://lists.ovirt.org/mailman/listinfo/users