Hello the list,
I have been slowly bringing up a 9-node cluster for the last few months.
All nodes are identical, dual 2-port 10G nics, lots of memory and CPU
Storage is a Netapp Filer accessed via NFS on a dedicated 10Gb
dual-switch environment.
Generally everything is working fine, but ever since our last rebuild of
the cluster in preperation for a move into production status we have
been getting repeated errors showing in the HostedEngine console:
VM foo is not responding.
VM bar is not responding.
VM baz is not responding.
These errors happen on a fairly regular basis, and generally are
multiple VMs all being hosted by different nodes. When errors occur I
also lose external connectivity to the VM in question, both via its
service IP address and via the ovirt console. The actual outages appear
to generally last 15-20 seconds and then things recover and go back to
normal.
We are also getting much more frequent errors:
ETL service sampling has encountered an error. Please consult the
service log for more details.
I have attached snippets from the Engine engine.log from this morning.
If any other logs are needed for to help diagnosis I can provide them.
--
Brian Ismay
SR. Systems Administrator
jismay(a)cenic.org
----
engine.log: NOTE, the system clock is in UTC, local time is PDT, so this
occurred at 07:48AM local time.
2017-08-09 14:48:37,237 INFO
[org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer]
(DefaultQuartzScheduler6) [2cea1ef7] VM
'69880324-2d2e-4a70-8071-4ae0f0ae342e'(vm1) moved from 'Up' -->
'NotResponding'
2017-08-09 14:48:37,277 WARN
[org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector]
(DefaultQuartzScheduler6) [2cea1ef7] Correlation ID: null, Call Stack:
null, Custom Event ID: -1, Message: VM vm1 is not responding.
2017-08-09 14:48:37,277 INFO
[org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer]
(DefaultQuartzScheduler6) [2cea1ef7] VM
'4471e3ee-9f69-4903-b68f-c1293aea047f'(vm2) moved from 'Up' -->
'NotResponding'
2017-08-09 14:48:37,282 WARN
[org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector]
(DefaultQuartzScheduler6) [2cea1ef7] Correlation ID: null, Call Stack:
null, Custom Event ID: -1, Message: VM vm2 is not responding.
2017-08-09 14:48:38,326 INFO
[org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer]
(DefaultQuartzScheduler5) [cf129f7] VM
'35fd4afa-12a1-4326-9db5-a86939a01fa8'(vm3) moved from 'Up' -->
'NotResponding'
2017-08-09 14:48:38,360 WARN
[org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector]
(DefaultQuartzScheduler5) [cf129f7] Correlation ID: null, Call Stack:
null, Custom Event ID: -1, Message: VM vm3 is not responding.
2017-08-09 14:48:38,360 INFO
[org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer]
(DefaultQuartzScheduler5) [cf129f7] VM
'd83e9633-3597-4046-95ee-2a166682b85e'(vm4) moved from 'Up' -->
'NotResponding'
2017-08-09 14:48:38,365 WARN
[org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector]
(DefaultQuartzScheduler5) [cf129f7] Correlation ID: null, Call Stack:
null, Custom Event ID: -1, Message: VM vm4 is not responding.
2017-08-09 14:48:49,075 INFO
[org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer]
(DefaultQuartzScheduler8) [3b1149ff] VM
'd41984d0-4418-4991-9af0-25593abac976'(vm5) moved from 'Up' -->
'NotResponding'
2017-08-09 14:48:49,130 WARN
[org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector]
(DefaultQuartzScheduler8) [3b1149ff] Correlation ID: null, Call Stack:
null, Custom Event ID: -1, Message: VM vm5 is not responding.
2017-08-09 14:48:49,131 INFO
[org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer]
(DefaultQuartzScheduler8) [3b1149ff] VM
'ed87b37d-5b79-4105-ba89-29a59361eb4e'(vm6) moved from 'Up' -->
'NotResponding'
2017-08-09 14:48:49,136 WARN
[org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector]
(DefaultQuartzScheduler8) [3b1149ff] Correlation ID: null, Call Stack:
null, Custom Event ID: -1, Message: VM vm6 is not responding.
2017-08-09 14:48:52,221 INFO
[org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer]
(DefaultQuartzScheduler7) [2973c87] VM
'506980f4-6764-4cc6-bb20-c1956d8ed201'(vm7) moved from 'Up' -->
'NotResponding'
2017-08-09 14:48:52,226 WARN
[org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector]
(DefaultQuartzScheduler7) [2973c87] Correlation ID: null, Call Stack:
null, Custom Event ID: -1, Message: VM vm7 is not responding.
2017-08-09 14:48:52,299 INFO
[org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer]
(DefaultQuartzScheduler6) [2cea1ef7] VM
'69880324-2d2e-4a70-8071-4ae0f0ae342e'(vm1) moved from 'NotResponding'
--> 'Up'
2017-08-09 14:48:52,300 INFO
[org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer]
(DefaultQuartzScheduler6) [2cea1ef7] VM
'4471e3ee-9f69-4903-b68f-c1293aea047f'(vm2) moved from 'NotResponding'
--> 'Up'
2017-08-09 14:48:53,373 INFO
[org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer]
(DefaultQuartzScheduler5) [cf129f7] VM
'638b2aab-e4f7-43e0-a2a8-95c75813e669'(vm8) moved from 'Up' -->
'NotResponding'
2017-08-09 14:48:53,379 WARN
[org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector]
(DefaultQuartzScheduler5) [cf129f7] Correlation ID: null, Call Stack:
null, Custom Event ID: -1, Message: VM vm8 is not responding.
2017-08-09 14:48:54,380 INFO
[org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer]
(DefaultQuartzScheduler6) [2cea1ef7] VM
'35fd4afa-12a1-4326-9db5-a86939a01fa8'(vm3) moved from 'NotResponding'
--> 'Up'
2017-08-09 14:48:54,381 INFO
[org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer]
(DefaultQuartzScheduler6) [2cea1ef7] VM
'd83e9633-3597-4046-95ee-2a166682b85e'(vm4) moved from 'NotResponding'
--> 'Up'
2017-08-09 14:49:04,197 INFO
[org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer]
(DefaultQuartzScheduler7) [2973c87] VM
'd41984d0-4418-4991-9af0-25593abac976'(vm5) moved from 'NotResponding'
--> 'Up'
2017-08-09 14:49:04,198 INFO
[org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer]
(DefaultQuartzScheduler7) [2973c87] VM
'ed87b37d-5b79-4105-ba89-29a59361eb4e'(vm6) moved from 'NotResponding'
--> 'Up'
2017-08-09 14:49:07,293 INFO
[org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer]
(DefaultQuartzScheduler8) [3b1149ff] VM
'506980f4-6764-4cc6-bb20-c1956d8ed201'(vm7) moved from 'NotResponding'
--> 'Up'
2017-08-09 14:49:09,388 INFO
[org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer]
(DefaultQuartzScheduler7) [2973c87] VM
'638b2aab-e4f7-43e0-a2a8-95c75813e669'(vm8) moved from 'NotResponding'
--> 'Up'