[ovirt-users] ovirt 3.6, we had the ovirt manager go down in a bad way and all VMs for one node marked Unknown and Not Reponding while up

Christopher Cox ccox at endlessnow.com
Mon Feb 5 16:02:39 UTC 2018


Forgive the top post.  I guess what I need to know now is whether there 
is a recovery path that doesn't lead to total loss of the VMs that are 
currently in the "Unknown" "Not responding" state.

We are planning a total oVirt shutdown.  I just would like to know if 
we've effectively lot those VMs or not.  Again, the VMs are currently 
"up".  And we use a file backup process, so in theory they can be 
restored, just somewhat painfully, from scratch.

But if somebody knows if we shutdown all the bad VMs and the blade, is 
there someway oVirt can know the VMs are "ok" to start up??  Will 
changing their state directly to "down" in the db stick if the blade is 
down?  That is, will we get to a known state where the VMs can actually 
be started and brought back into a known state?

Right now, we're feeling there's a good chance we will not be able to 
recover these VMs, even though they are "up" right now.  I really need 
some way to force oVirt into an integral state, even if it means we take 
the whole thing down.

Possible?


On 01/25/2018 06:57 PM, Christopher Cox wrote:
> 
> 
> On 01/25/2018 04:57 PM, Douglas Landgraf wrote:
>> On Thu, Jan 25, 2018 at 5:12 PM, Christopher Cox <ccox at endlessnow.com> 
>> wrote:
>>> On 01/25/2018 02:25 PM, Douglas Landgraf wrote:
>>>>
>>>> On Wed, Jan 24, 2018 at 10:18 AM, Christopher Cox <ccox at endlessnow.com>
>>>> wrote:
>>>>>
>>>>> Would restarting vdsm on the node in question help fix this?  
>>>>> Again, all
>>>>> the
>>>>> VMs are up on the node.  Prior attempts to fix this problem have 
>>>>> left the
>>>>> node in a state where I can issue the "has been rebooted" command 
>>>>> to it,
>>>>> it's confused.
>>>>>
>>>>> So... node is up.  All VMs are up.  Can't issue "has been rebooted" to
>>>>> the
>>>>> node, all VMs show Unknown and not responding but they are up.
>>>>>
>>>>> Chaning the status is the ovirt db to 0 works for a second and then it
>>>>> goes
>>>>> immediately back to 8 (which is why I'm wondering if I should restart
>>>>> vdsm
>>>>> on the node).
>>>>
>>>>
>>>> It's not recommended to change db manually.
>>>>
>>>>>
>>>>> Oddly enough, we're running all of this in production.  So, 
>>>>> watching it
>>>>> all
>>>>> go down isn't the best option for us.
>>>>>
>>>>> Any advice is welcome.
>>>>
>>>>
>>>>
>>>> We would need to see the node/engine logs, have you found any error in
>>>> the vdsm.log
>>>> (from nodes) or engine.log? Could you please share the error?
>>>
>>>
>>>
>>> In short, the error is our ovirt manager lost network (our problem) and
>>> crashed hard (hardware issue on the server)..  On bring up, we had some
>>> network changes (that caused the lost network problem) so our LACP 
>>> bond was
>>> down for a bit while we were trying to bring it up (noting the ovirt 
>>> manager
>>> is up while we're reestablishing the network on the switch side).
>>>
>>> In other word, that's the "error" so to speak that got us to where we 
>>> are.
>>>
>>> Full DEBUG enabled on the logs... The error messages seem obvious to 
>>> me..
>>> starts like this (nothing the ISO DOMAIN was coming off an NFS mount 
>>> off the
>>> ovirt management server... yes... we know... we do have plans to move 
>>> that).
>>>
>>> So on the hypervisor node itself, from the vdsm.log (vdsm.log.33.xz):
>>>
>>> (hopefully no surprise here)
>>>
>>> Thread-2426633::WARNING::2018-01-23
>>> 13:50:56,672::fileSD::749::Storage.scanDomains::(collectMetaFiles) 
>>> Could not
>>> collect metadata file for domain path
>>> /rhev/data-center/mnt/d0lppc129.skopos.me:_var_lib_exports_iso-20160408002844 
>>>
>>> Traceback (most recent call last):
>>>    File "/usr/share/vdsm/storage/fileSD.py", line 735, in 
>>> collectMetaFiles
>>>      sd.DOMAIN_META_DATA))
>>>    File "/usr/share/vdsm/storage/outOfProcess.py", line 121, in glob
>>>      return self._iop.glob(pattern)
>>>    File "/usr/lib/python2.7/site-packages/ioprocess/__init__.py", 
>>> line 536,
>>> in glob
>>>      return self._sendCommand("glob", {"pattern": pattern}, 
>>> self.timeout)
>>>    File "/usr/lib/python2.7/site-packages/ioprocess/__init__.py", 
>>> line 421,
>>> in _sendCommand
>>>      raise Timeout(os.strerror(errno.ETIMEDOUT))
>>> Timeout: Connection timed out
>>> Thread-27::ERROR::2018-01-23
>>> 13:50:56,672::sdc::145::Storage.StorageDomainCache::(_findDomain) domain
>>> e5ecae2f-5a06-4743-9a43-e74d83992c35 not found
>>> Traceback (most recent call last):
>>>    File "/usr/share/vdsm/storage/sdc.py", line 143, in _findDomain
>>>      dom = findMethod(sdUUID)
>>>    File "/usr/share/vdsm/storage/nfsSD.py", line 122, in findDomain
>>>      return NfsStorageDomain(NfsStorageDomain.findDomainPath(sdUUID))
>>>    File "/usr/share/vdsm/storage/nfsSD.py", line 112, in findDomainPath
>>>      raise se.StorageDomainDoesNotExist(sdUUID)
>>> StorageDomainDoesNotExist: Storage domain does not exist:
>>> (u'e5ecae2f-5a06-4743-9a43-e74d83992c35',)
>>> Thread-27::ERROR::2018-01-23
>>> 13:50:56,673::monitor::276::Storage.Monitor::(_monitorDomain) Error
>>> monitoring domain e5ecae2f-5a06-4743-9a43-e74d83992c35
>>> Traceback (most recent call last):
>>>    File "/usr/share/vdsm/storage/monitor.py", line 272, in 
>>> _monitorDomain
>>>      self._performDomainSelftest()
>>>    File "/usr/lib/python2.7/site-packages/vdsm/utils.py", line 769, in
>>> wrapper
>>>      value = meth(self, *a, **kw)
>>>    File "/usr/share/vdsm/storage/monitor.py", line 339, in
>>> _performDomainSelftest
>>>      self.domain.selftest()
>>>    File "/usr/share/vdsm/storage/sdc.py", line 49, in __getattr__
>>>      return getattr(self.getRealDomain(), attrName)
>>>    File "/usr/share/vdsm/storage/sdc.py", line 52, in getRealDomain
>>>      return self._cache._realProduce(self._sdUUID)
>>>    File "/usr/share/vdsm/storage/sdc.py", line 124, in _realProduce
>>>      domain = self._findDomain(sdUUID)
>>>    File "/usr/share/vdsm/storage/sdc.py", line 143, in _findDomain
>>>      dom = findMethod(sdUUID)
>>>    File "/usr/share/vdsm/storage/nfsSD.py", line 122, in findDomain
>>>      return NfsStorageDomain(NfsStorageDomain.findDomainPath(sdUUID))
>>>    File "/usr/share/vdsm/storage/nfsSD.py", line 112, in findDomainPath
>>>      raise se.StorageDomainDoesNotExist(sdUUID)
>>> StorageDomainDoesNotExist: Storage domain does not exist:
>>> (u'e5ecae2f-5a06-4743-9a43-e74d83992c35',)
>>>
>>>
>>> Again, all the hypervisor nodes will complain about having the NFS 
>>> area for
>>> ISO DOMAIN now gone.  Remember the ovirt manager node held this and 
>>> it has
>>> now network has gone out and the node crashed (note: the ovirt node (the
>>> actual server box) shouldn't crash due to the network outage, but it 
>>> did.
>>
>>
>> I have added VDSM people in this thread to review it. I am assuming
>> the network changes (during the crash) still make the storage domain
>> available for the nodes.
> 
> Ideally, nothing was lost node wise (neither LAN nor iSCSI), just the 
> ovirt manager lost its network connection.  So the only thing, as I 
> mentioned, storage wise that was lost was the ISO DOMAIN which was NFS'd 
> off the ovirt manager.
> 
>>
>>>
>>> So here is the engine collapse as it lost network connectivity 
>>> (before the
>>> server actually crashed hard).
>>>
>>> 2018-01-23 13:45:33,666 ERROR
>>> [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector]
>>> (DefaultQuartzScheduler_Worker-87) [] Correlation ID: null, Call Stack:
>>> null, Custom Event ID: -1, Message: VDSM d0lppn067 command failed: 
>>> Heartbeat
>>> exeeded
>>> 2018-01-23 13:45:33,666 ERROR
>>> [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector]
>>> (DefaultQuartzScheduler_Worker-10) [21574461] Correlation ID: null, Call
>>> Stack: null, Custom Event ID: -1, Message: VDSM d0lppn072 command 
>>> failed:
>>> Heartbeat exeeded
>>> 2018-01-23 13:45:33,666 ERROR
>>> [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector]
>>> (DefaultQuartzScheduler_Worker-37) [4e8ec41d] Correlation ID: null, Call
>>> Stack: null, Custom Event ID: -1, Message: VDSM d0lppn066 command 
>>> failed:
>>> Heartbeat exeeded
>>> 2018-01-23 13:45:33,667 ERROR
>>> [org.ovirt.engine.core.vdsbroker.vdsbroker.GetStatsVDSCommand]
>>> (DefaultQuartzScheduler_Worker-87) [] Command 
>>> 'GetStatsVDSCommand(HostName =
>>> d0lppn067, VdsIdAndVdsVDSCommandParametersBase:{runAsync='true',
>>> hostId='f99c68c8-b0e8-437b-8cd9-ebaddaaede96',
>>> vds='Host[d0lppn067,f99c68c8-b0e8-437b-8cd9-ebaddaaede96]'})' execution
>>> failed: VDSGenericException: VDSNetworkException: Heartbeat exeeded
>>> 2018-01-23 13:45:33,667 ERROR
>>> [org.ovirt.engine.core.vdsbroker.vdsbroker.GetStatsVDSCommand]
>>> (DefaultQuartzScheduler_Worker-10) [21574461] Command
>>> 'GetStatsVDSCommand(HostName = d0lppn072,
>>> VdsIdAndVdsVDSCommandParametersBase:{runAsync='true',
>>> hostId='fdc00296-973d-4268-bd79-6dac535974e0',
>>> vds='Host[d0lppn072,fdc00296-973d-4268-bd79-6dac535974e0]'})' execution
>>> failed: VDSGenericException: VDSNetworkException: Heartbeat exeeded
>>> 2018-01-23 13:45:33,667 ERROR
>>> [org.ovirt.engine.core.vdsbroker.vdsbroker.GetStatsVDSCommand]
>>> (DefaultQuartzScheduler_Worker-37) [4e8ec41d] Command
>>> 'GetStatsVDSCommand(HostName = d0lppn066,
>>> VdsIdAndVdsVDSCommandParametersBase:{runAsync='true',
>>> hostId='14abf559-4b62-4ebd-a345-77fa9e1fa3ae',
>>> vds='Host[d0lppn066,14abf559-4b62-4ebd-a345-77fa9e1fa3ae]'})' execution
>>> failed: VDSGenericException: VDSNetworkException: Heartbeat exeeded
>>> 2018-01-23 13:45:33,669 ERROR
>>> [org.ovirt.engine.core.vdsbroker.HostMonitoring]
>>> (DefaultQuartzScheduler_Worker-87) []  Failed getting vds stats,
>>> vds='d0lppn067'(f99c68c8-b0e8-437b-8cd9-ebaddaaede96):
>>> org.ovirt.engine.core.vdsbroker.vdsbroker.VDSNetworkException:
>>> VDSGenericException: VDSNetworkException: Heartbeat exeeded
>>> 2018-01-23 13:45:33,669 ERROR
>>> [org.ovirt.engine.core.vdsbroker.HostMonitoring]
>>> (DefaultQuartzScheduler_Worker-10) [21574461]  Failed getting vds stats,
>>> vds='d0lppn072'(fdc00296-973d-4268-bd79-6dac535974e0):
>>> org.ovirt.engine.core.vdsbroker.vdsbroker.VDSNetworkException:
>>> VDSGenericException: VDSNetworkException: Heartbeat exeeded
>>> 2018-01-23 13:45:33,669 ERROR
>>> [org.ovirt.engine.core.vdsbroker.HostMonitoring]
>>> (DefaultQuartzScheduler_Worker-37) [4e8ec41d]  Failed getting vds stats,
>>> vds='d0lppn066'(14abf559-4b62-4ebd-a345-77fa9e1fa3ae):
>>> org.ovirt.engine.core.vdsbroker.vdsbroker.VDSNetworkException:
>>> VDSGenericException: VDSNetworkException: Heartbeat exeeded
>>> 2018-01-23 13:45:33,671 ERROR
>>> [org.ovirt.engine.core.vdsbroker.HostMonitoring]
>>> (DefaultQuartzScheduler_Worker-10) [21574461] Failure to refresh Vds 
>>> runtime
>>> info: VDSGenericException: VDSNetworkException: Heartbeat exeeded
>>> 2018-01-23 13:45:33,671 ERROR
>>> [org.ovirt.engine.core.vdsbroker.HostMonitoring]
>>> (DefaultQuartzScheduler_Worker-37) [4e8ec41d] Failure to refresh Vds 
>>> runtime
>>> info: VDSGenericException: VDSNetworkException: Heartbeat exeeded
>>> 2018-01-23 13:45:33,671 ERROR
>>> [org.ovirt.engine.core.vdsbroker.HostMonitoring]
>>> (DefaultQuartzScheduler_Worker-87) [] Failure to refresh Vds runtime 
>>> info:
>>> VDSGenericException: VDSNetworkException: Heartbeat exeeded
>>> 2018-01-23 13:45:33,671 ERROR
>>> [org.ovirt.engine.core.vdsbroker.HostMonitoring]
>>> (DefaultQuartzScheduler_Worker-37) [4e8ec41d] Exception:
>>> org.ovirt.engine.core.vdsbroker.vdsbroker.VDSNetworkException:
>>> VDSGenericException: VDSNetworkException: Heartbeat exeeded
>>>          at
>>> org.ovirt.engine.core.vdsbroker.vdsbroker.BrokerCommandBase.proceedProxyReturnValue(BrokerCommandBase.java:188) 
>>>
>>> [vdsbroker.jar:]
>>>          at
>>> org.ovirt.engine.core.vdsbroker.vdsbroker.GetStatsVDSCommand.executeVdsBrokerCommand(GetStatsVDSCommand.java:21) 
>>>
>>> [vdsbroker.jar:]
>>>          at
>>> org.ovirt.engine.core.vdsbroker.vdsbroker.VdsBrokerCommand.executeVDSCommand(VdsBrokerCommand.java:110) 
>>>
>>> [vdsbroker.jar:]
>>>          at
>>> org.ovirt.engine.core.vdsbroker.VDSCommandBase.executeCommand(VDSCommandBase.java:65) 
>>>
>>> [vdsbroker.jar:]
>>>          at
>>> org.ovirt.engine.core.dal.VdcCommandBase.execute(VdcCommandBase.java:33)
>>> [dal.jar:]
>>>          at
>>> org.ovirt.engine.core.vdsbroker.ResourceManager.runVdsCommand(ResourceManager.java:467) 
>>>
>>> [vdsbroker.jar:]
>>>          at
>>> org.ovirt.engine.core.vdsbroker.HostMonitoring.refreshVdsStats(HostMonitoring.java:472) 
>>>
>>> [vdsbroker.jar:]
>>>          at
>>> org.ovirt.engine.core.vdsbroker.HostMonitoring.refreshVdsRunTimeInfo(HostMonitoring.java:114) 
>>>
>>> [vdsbroker.jar:]
>>>          at
>>> org.ovirt.engine.core.vdsbroker.HostMonitoring.refresh(HostMonitoring.java:84) 
>>>
>>> [vdsbroker.jar:]
>>>          at
>>> org.ovirt.engine.core.vdsbroker.VdsManager.onTimer(VdsManager.java:227)
>>> [vdsbroker.jar:]
>>>          at sun.reflect.GeneratedMethodAccessor75.invoke(Unknown Source)
>>> [:1.8.0_102]
>>>          at
>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) 
>>>
>>> [rt.jar:1.8.0_102]
>>>          at java.lang.reflect.Method.invoke(Method.java:498)
>>> [rt.jar:1.8.0_102]
>>>          at
>>> org.ovirt.engine.core.utils.timer.JobWrapper.invokeMethod(JobWrapper.java:81) 
>>>
>>> [scheduler.jar:]
>>>          at
>>> org.ovirt.engine.core.utils.timer.JobWrapper.execute(JobWrapper.java:52)
>>> [scheduler.jar:]
>>>          at org.quartz.core.JobRunShell.run(JobRunShell.java:213)
>>> [quartz.jar:]
>>>          at
>>> org.quartz.simpl.SimpleThreadPool$WorkerThread.run(SimpleThreadPool.java:557) 
>>>
>>> [quartz.jar:]
>>>
>>> 2018-01-23 13:45:33,671 ERROR
>>> [org.ovirt.engine.core.vdsbroker.HostMonitoring]
>>> (DefaultQuartzScheduler_Worker-10) [21574461] Exception:
>>> org.ovirt.engine.core.vdsbroker.vdsbroker.VDSNetworkException:
>>> VDSGenericException: VDSNetworkException: Heartbeat exeeded
>>>          at
>>> org.ovirt.engine.core.vdsbroker.vdsbroker.BrokerCommandBase.proceedProxyReturnValue(BrokerCommandBase.java:188) 
>>>
>>> [vdsbroker.jar:]
>>>          at
>>> org.ovirt.engine.core.vdsbroker.vdsbroker.GetStatsVDSCommand.executeVdsBrokerCommand(GetStatsVDSCommand.java:21) 
>>>
>>> [vdsbroker.jar:]
>>>          at
>>> org.ovirt.engine.core.vdsbroker.vdsbroker.VdsBrokerCommand.executeVDSCommand(VdsBrokerCommand.java:110) 
>>>
>>> [vdsbroker.jar:]
>>>          at
>>> org.ovirt.engine.core.vdsbroker.VDSCommandBase.executeCommand(VDSCommandBase.java:65) 
>>>
>>> [vdsbroker.jar:]
>>>          at
>>> org.ovirt.engine.core.dal.VdcCommandBase.execute(VdcCommandBase.java:33)
>>> [dal.jar:]
>>>          at
>>> org.ovirt.engine.core.vdsbroker.ResourceManager.runVdsCommand(ResourceManager.java:467) 
>>>
>>> [vdsbroker.jar:]
>>>          at
>>> org.ovirt.engine.core.vdsbroker.HostMonitoring.refreshVdsStats(HostMonitoring.java:472) 
>>>
>>> [vdsbroker.jar:]
>>>          at
>>> org.ovirt.engine.core.vdsbroker.HostMonitoring.refreshVdsRunTimeInfo(HostMonitoring.java:114) 
>>>
>>> [vdsbroker.jar:]
>>>          at
>>> org.ovirt.engine.core.vdsbroker.HostMonitoring.refresh(HostMonitoring.java:84) 
>>>
>>> [vdsbroker.jar:]
>>>          at
>>> org.ovirt.engine.core.vdsbroker.VdsManager.onTimer(VdsManager.java:227)
>>> [vdsbroker.jar:]
>>>          at sun.reflect.GeneratedMethodAccessor75.invoke(Unknown Source)
>>> [:1.8.0_102]
>>>          at
>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) 
>>>
>>> [rt.jar:1.8.0_102]
>>>          at java.lang.reflect.Method.invoke(Method.java:498)
>>> [rt.jar:1.8.0_102]
>>>          at
>>> org.ovirt.engine.core.utils.timer.JobWrapper.invokeMethod(JobWrapper.java:81) 
>>>
>>> [scheduler.jar:]
>>>          at
>>> org.ovirt.engine.core.utils.timer.JobWrapper.execute(JobWrapper.java:52)
>>> [scheduler.jar:]
>>>          at org.quartz.core.JobRunShell.run(JobRunShell.java:213)
>>> [quartz.jar:]
>>>          at
>>> org.quartz.simpl.SimpleThreadPool$WorkerThread.run(SimpleThreadPool.java:557) 
>>>
>>> [quartz.jar:]
>>>
>>> 2018-01-23 13:45:33,671 ERROR
>>> [org.ovirt.engine.core.vdsbroker.HostMonitoring]
>>> (DefaultQuartzScheduler_Worker-87) [] Exception:
>>> org.ovirt.engine.core.vdsbroker.vdsbroker.VDSNetworkException:
>>> VDSGenericException: VDSNetworkException: Heartbeat exeeded
>>>          at
>>> org.ovirt.engine.core.vdsbroker.vdsbroker.BrokerCommandBase.proceedProxyReturnValue(BrokerCommandBase.java:188) 
>>>
>>> [vdsbroker.jar:]
>>>          at
>>> org.ovirt.engine.core.vdsbroker.vdsbroker.GetStatsVDSCommand.executeVdsBrokerCommand(GetStatsVDSCommand.java:21) 
>>>
>>> [vdsbroker.jar:]
>>>          at
>>> org.ovirt.engine.core.vdsbroker.vdsbroker.VdsBrokerCommand.executeVDSCommand(VdsBrokerCommand.java:110) 
>>>
>>> [vdsbroker.jar:]
>>>          at
>>> org.ovirt.engine.core.vdsbroker.VDSCommandBase.executeCommand(VDSCommandBase.java:65) 
>>>
>>> [vdsbroker.jar:]
>>>          at
>>> org.ovirt.engine.core.dal.VdcCommandBase.execute(VdcCommandBase.java:33)
>>> [dal.jar:]
>>>          at
>>> org.ovirt.engine.core.vdsbroker.ResourceManager.runVdsCommand(ResourceManager.java:467) 
>>>
>>> [vdsbroker.jar:]
>>>          at
>>> org.ovirt.engine.core.vdsbroker.HostMonitoring.refreshVdsStats(HostMonitoring.java:472) 
>>>
>>> [vdsbroker.jar:]
>>>
>>>
>>>
>>>
>>> Here are the engine logs show problem with node d0lppn065, the VMs 
>>> first go
>>> to "Unknown" then then "Unknown" plus "not responding":
>>>
>>> 2018-01-23 14:48:00,712 ERROR
>>> [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector]
>>> (org.ovirt.thread.pool-8-thread-28) [] Correlation ID: null, Call Stack:
>>> org.ovirt.engine.core.vdsbroker.vdsbroker.VDSNetworkException:
>>> org.ovirt.vdsm.jsonrpc.client.ClientConnection
>>> Exception: Connection failed
>>>          at
>>> org.ovirt.engine.core.vdsbroker.vdsbroker.VdsBrokerCommand.createNetworkException(VdsBrokerCommand.java:157) 
>>>
>>>          at
>>> org.ovirt.engine.core.vdsbroker.vdsbroker.VdsBrokerCommand.executeVDSCommand(VdsBrokerCommand.java:120) 
>>>
>>>          at
>>> org.ovirt.engine.core.vdsbroker.VDSCommandBase.executeCommand(VDSCommandBase.java:65) 
>>>
>>>          at
>>> org.ovirt.engine.core.dal.VdcCommandBase.execute(VdcCommandBase.java:33)
>>>          at
>>> org.ovirt.engine.core.vdsbroker.ResourceManager.runVdsCommand(ResourceManager.java:467) 
>>>
>>>          at
>>> org.ovirt.engine.core.vdsbroker.VmsStatisticsFetcher.fetch(VmsStatisticsFetcher.java:27) 
>>>
>>>          at
>>> org.ovirt.engine.core.vdsbroker.PollVmStatsRefresher.poll(PollVmStatsRefresher.java:35) 
>>>
>>>          at sun.reflect.GeneratedMethodAccessor80.invoke(Unknown Source)
>>>          at
>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) 
>>>
>>>          at java.lang.reflect.Method.invoke(Method.java:498)
>>>          at
>>> org.ovirt.engine.core.utils.timer.JobWrapper.invokeMethod(JobWrapper.java:81) 
>>>
>>>          at
>>> org.ovirt.engine.core.utils.timer.JobWrapper.execute(JobWrapper.java:52)
>>>          at org.quartz.core.JobRunShell.run(JobRunShell.java:213)
>>>          at
>>> org.quartz.simpl.SimpleThreadPool$WorkerThread.run(SimpleThreadPool.java:557) 
>>>
>>> Caused by: org.ovirt.vdsm.jsonrpc.client.ClientConnectionException:
>>> Connection failed
>>>          at
>>> org.ovirt.vdsm.jsonrpc.client.reactors.ReactorClient.connect(ReactorClient.java:155) 
>>>
>>>          at
>>> org.ovirt.vdsm.jsonrpc.client.JsonRpcClient.getClient(JsonRpcClient.java:134) 
>>>
>>>          at
>>> org.ovirt.vdsm.jsonrpc.client.JsonRpcClient.call(JsonRpcClient.java:81)
>>>          at
>>> org.ovirt.engine.core.vdsbroker.jsonrpc.FutureMap.<init>(FutureMap.java:70) 
>>>
>>>          at
>>> org.ovirt.engine.core.vdsbroker.jsonrpc.JsonRpcVdsServer.getAllVmStats(JsonRpcVdsServer.java:331) 
>>>
>>>          at
>>> org.ovirt.engine.core.vdsbroker.vdsbroker.GetAllVmStatsVDSCommand.executeVdsBrokerCommand(GetAllVmStatsVDSCommand.java:20) 
>>>
>>>          at
>>> org.ovirt.engine.core.vdsbroker.vdsbroker.VdsBrokerCommand.executeVDSCommand(VdsBrokerCommand.java:110) 
>>>
>>>          ... 12 more
>>> , Custom Event ID: -1, Message: Host d0lppn065 is non responsive.
>>> 2018-01-23 14:48:00,713 INFO 
>>> [org.ovirt.engine.core.bll.VdsEventListener]
>>> (org.ovirt.thread.pool-8-thread-1) [] ResourceManager::vdsNotResponding
>>> entered for Host '2797cae7-6886-4898-a5e4-23361ce03a90', '10.32.0.65'
>>> 2018-01-23 14:48:00,713 WARN
>>> [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector]
>>> (org.ovirt.thread.pool-8-thread-36) [] Correlation ID: null, Call Stack:
>>> null, Custom Event ID: -1, Message: VM vtop3 was set to the Unknown 
>>> status.
>>>
>>> ...etc... (sorry about the wraps below)
>>>
>>> 2018-01-23 14:59:07,817 INFO 
>>> [org.ovirt.engine.core.vdsbroker.VmAnalyzer]
>>> (DefaultQuartzScheduler_Worker-75) [] VM
>>> '30f7af86-c2b9-41c3-b2c5-49f5bbdd0e27'(d0lpvd070) moved from 'Up' -->
>>> 'NotResponding'
>>> 2018-01-23 14:59:07,819 INFO
>>> [org.ovirt.engine.core.vdsbroker.VmsStatisticsFetcher]
>>> (DefaultQuartzScheduler_Worker-74) [] Fetched 15 VMs from VDS
>>> '8cb119c5-b7f0-48a3-970a-205d96b2e940'
>>> 2018-01-23 14:59:07,936 WARN
>>> [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector]
>>> (DefaultQuartzScheduler_Worker-75) [] Correlation ID: null, Call Stack:
>>> null, Custom Event ID: -1, Message: VM d0lpvd070 is not responding.
>>> 2018-01-23 14:59:07,939 INFO 
>>> [org.ovirt.engine.core.vdsbroker.VmAnalyzer]
>>> (DefaultQuartzScheduler_Worker-75) [] VM
>>> 'ebc5bb82-b985-451b-8313-827b5f40eaf3'(d0lpvd039) moved from 'Up' -->
>>> 'NotResponding'
>>> 2018-01-23 14:59:08,032 WARN
>>> [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector]
>>> (DefaultQuartzScheduler_Worker-75) [] Correlation ID: null, Call Stack:
>>> null, Custom Event ID: -1, Message: VM d0lpvd039 is not responding.
>>> 2018-01-23 14:59:08,038 INFO 
>>> [org.ovirt.engine.core.vdsbroker.VmAnalyzer]
>>> (DefaultQuartzScheduler_Worker-75) [] VM
>>> '494c4f9e-1616-476a-8f66-a26a96b76e56'(vtop3) moved from 'Up' -->
>>> 'NotResponding'
>>> 2018-01-23 14:59:08,134 WARN
>>> [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector]
>>> (DefaultQuartzScheduler_Worker-75) [] Correlation ID: null, Call Stack:
>>> null, Custom Event ID: -1, Message: VM vtop3 is not responding.
>>> 2018-01-23 14:59:08,136 INFO 
>>> [org.ovirt.engine.core.vdsbroker.VmAnalyzer]
>>> (DefaultQuartzScheduler_Worker-75) [] VM
>>> 'eaeaf73c-d9e2-426e-a2f2-7fcf085137b0'(d0lpvw059) moved from 'Up' -->
>>> 'NotResponding'
>>> 2018-01-23 14:59:08,237 WARN
>>> [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector]
>>> (DefaultQuartzScheduler_Worker-75) [] Correlation ID: null, Call Stack:
>>> null, Custom Event ID: -1, Message: VM d0lpvw059 is not responding.
>>> 2018-01-23 14:59:08,239 INFO 
>>> [org.ovirt.engine.core.vdsbroker.VmAnalyzer]
>>> (DefaultQuartzScheduler_Worker-75) [] VM
>>> '8308a547-37a1-4163-8170-f89b6dc85ba8'(d0lpvm058) moved from 'Up' -->
>>> 'NotResponding'
>>> 2018-01-23 14:59:08,326 WARN
>>> [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector]
>>> (DefaultQuartzScheduler_Worker-75) [] Correlation ID: null, Call Stack:
>>> null, Custom Event ID: -1, Message: VM d0lpvm058 is not responding.
>>> 2018-01-23 14:59:08,328 INFO 
>>> [org.ovirt.engine.core.vdsbroker.VmAnalyzer]
>>> (DefaultQuartzScheduler_Worker-75) [] VM
>>> '3d544926-3326-44e1-8b2a-ec632f51112a'(d0lqva056) moved from 'Up' -->
>>> 'NotResponding'
>>> 2018-01-23 14:59:08,400 WARN
>>> [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector]
>>> (DefaultQuartzScheduler_Worker-75) [] Correlation ID: null, Call Stack:
>>> null, Custom Event ID: -1, Message: VM d0lqva056 is not responding.
>>> 2018-01-23 14:59:08,402 INFO 
>>> [org.ovirt.engine.core.vdsbroker.VmAnalyzer]
>>> (DefaultQuartzScheduler_Worker-75) [] VM
>>> '989e5a17-789d-4eba-8a5e-f74846128842'(d0lpva078) moved from 'Up' -->
>>> 'NotResponding'
>>> 2018-01-23 14:59:08,472 WARN
>>> [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector]
>>> (DefaultQuartzScheduler_Worker-75) [] Correlation ID: null, Call Stack:
>>> null, Custom Event ID: -1, Message: VM d0lpva078 is not responding.
>>> 2018-01-23 14:59:08,474 INFO 
>>> [org.ovirt.engine.core.vdsbroker.VmAnalyzer]
>>> (DefaultQuartzScheduler_Worker-75) [] VM
>>> '050a71c1-9e65-43c6-bdb2-18eba571e2eb'(d0lpvw077) moved from 'Up' -->
>>> 'NotResponding'
>>> 2018-01-23 14:59:08,545 WARN
>>> [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector]
>>> (DefaultQuartzScheduler_Worker-75) [] Correlation ID: null, Call Stack:
>>> null, Custom Event ID: -1, Message: VM d0lpvw077 is not responding.
>>> 2018-01-23 14:59:08,547 INFO 
>>> [org.ovirt.engine.core.vdsbroker.VmAnalyzer]
>>> (DefaultQuartzScheduler_Worker-75) [] VM
>>> 'c3b497fd-6181-4dd1-9acf-8e32f981f769'(d0lpva079) moved from 'Up' -->
>>> 'NotResponding'
>>> 2018-01-23 14:59:08,621 WARN
>>> [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector]
>>> (DefaultQuartzScheduler_Worker-75) [] Correlation ID: null, Call Stack:
>>> null, Custom Event ID: -1, Message: VM d0lpva079 is not responding.
>>> 2018-01-23 14:59:08,623 INFO 
>>> [org.ovirt.engine.core.vdsbroker.VmAnalyzer]
>>> (DefaultQuartzScheduler_Worker-75) [] VM
>>> '7cd22b39-feb1-4c6e-8643-ac8fb0578842'(d0lqva034) moved from 'Up' -->
>>> 'NotResponding'
>>> 2018-01-23 14:59:08,690 WARN
>>> [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector]
>>> (DefaultQuartzScheduler_Worker-75) [] Correlation ID: null, Call Stack:
>>> null, Custom Event ID: -1, Message: VM d0lqva034 is not responding.
>>> 2018-01-23 14:59:08,692 INFO 
>>> [org.ovirt.engine.core.vdsbroker.VmAnalyzer]
>>> (DefaultQuartzScheduler_Worker-75) [] VM
>>> '2ab9b1d8-d1e8-4071-a47c-294e586d2fb6'(d0lpvd038) moved from 'Up' -->
>>> 'NotResponding'
>>> 2018-01-23 14:59:08,763 WARN
>>> [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector]
>>> (DefaultQuartzScheduler_Worker-75) [] Correlation ID: null, Call Stack:
>>> null, Custom Event ID: -1, Message: VM d0lpvd038 is not responding.
>>> 2018-01-23 14:59:08,768 INFO 
>>> [org.ovirt.engine.core.vdsbroker.VmAnalyzer]
>>> (DefaultQuartzScheduler_Worker-75) [] VM
>>> 'ecb4e795-9eeb-4cdc-a356-c1b9b32af5aa'(d0lqva031) moved from 'Up' -->
>>> 'NotResponding'
>>> 2018-01-23 14:59:08,836 WARN
>>> [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector]
>>> (DefaultQuartzScheduler_Worker-75) [] Correlation ID: null, Call Stack:
>>> null, Custom Event ID: -1, Message: VM d0lqva031 is not responding.
>>> 2018-01-23 14:59:08,838 INFO 
>>> [org.ovirt.engine.core.vdsbroker.VmAnalyzer]
>>> (DefaultQuartzScheduler_Worker-75) [] VM
>>> '1a361727-1607-43d9-bd22-34d45b386d3e'(d0lqva033) moved from 'Up' -->
>>> 'NotResponding'
>>> 2018-01-23 14:59:08,911 WARN
>>> [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector]
>>> (DefaultQuartzScheduler_Worker-75) [] Correlation ID: null, Call Stack:
>>> null, Custom Event ID: -1, Message: VM d0lqva033 is not responding.
>>> 2018-01-23 14:59:08,913 INFO 
>>> [org.ovirt.engine.core.vdsbroker.VmAnalyzer]
>>> (DefaultQuartzScheduler_Worker-75) [] VM
>>> '0cd65f90-719e-429e-a845-f425612d7b14'(vtop4) moved from 'Up' -->
>>> 'NotResponding'
>>> 2018-01-23 14:59:08,984 WARN
>>> [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector]
>>> (DefaultQuartzScheduler_Worker-75) [] Correlation ID: null, Call Stack:
>>> null, Custom Event ID: -1, Message: VM vtop4 is not responding.
>>>
>>>>
>>>> Probably it's time to think to upgrade your environment from 3.6.
>>>
>>>
>>> I know.  But from a production standpoint mid-2016 wasn't that long ago.
>>> And 4 was just coming out of beta at the time.
>>>
>>> We were upgrading from 3.4 to 3.6.  And it took a long time (again, 
>>> because
>>> it's all "live").  Trust me, the move to 4.0 was discussed, it was 
>>> just a
>>> timing thing.
>>>
>>> With that said, I do "hear you"....and certainly it's being 
>>> discussed. We
>>> just don't see a "good" migration path... we see a slow path (moving 
>>> nodes
>>> out, upgrading, etc.) and knowing that as with all things, nobody can
>>> guarantee "success", which would be a very bad thing.  So going from 
>>> working
>>> 3.6 to totally (potential) broken 4.2, isn't going to impress anyone 
>>> here,
>>> you know?  If all goes according to our best guesses, then great, but 
>>> when
>>> things go bad, and the chance is not insignificant, well... I'm just not
>>> quite prepared with my résumé if you know what I mean.
>>>
>>> Don't get me wrong, our move from 3.4 to 3.6 had some similar risks, 
>>> but we
>>> also migrated to whole new infrastructure, a luxury we will not have 
>>> this
>>> time.  And somehow 3.4 to 3.6 doesn't sound as risky as 3.6 to 4.2.
>>
>> I see your concern. However,  keep your system updated with recent
>> software is something I would recommend. You could setup a parallel
>> 4.2 env and move the VMS slowly from 3.6.
> 
> Understood.  But would people want software that changes so quickly? 
> This isn't like moving from RH 7.2 to 7.3 in a matter of months, it's 
> more like moving from major release to major release in a matter of 
> months and doing again potentially in a matter of months.  Granted we're 
> running oVirt and not RHV, so maybe we should be on the Fedora style 
> upgrade plan.  Just not conducive to an enterprise environment (oVirt 
> people, stop laughing).
> 
>>
>>>
>>> Is there a path from oVirt to RHEV?  Every bit of help we get helps 
>>> us in
>>> making that decision as well, which I think would be a very good 
>>> thing for
>>> both of us. (I inherited all this oVirt and I was the "guy" doing the 
>>> 3.4 to
>>> 3.6 with the all new infrastructure).
>>
>> Yes, you can import your setup to RHEV.
> 
> Good to know. Because of the fragility (support wise... I'm mean our 
> oVirt has been rock solid, apart from rare glitches like this), we may 
> follow this path.
> _______________________________________________
> Users mailing list
> Users at ovirt.org
> http://lists.ovirt.org/mailman/listinfo/users


More information about the Users mailing list