Hi,
Last night we have an incident of a failed host. Engine issued a fence but did not restart
the vms running on that node on other operational hosts. I'd like to know if this is
normal or I can tune it somehow.
Here are some relevant logs from engine:
2018-09-05 03:00:51,496+03 WARN [org.ovirt.engine.core.vdsbroker.VdsManager]
(EE-ManagedThreadFactory-engine-Thread-827644) [] Host 'v3' is not responding. It
will stay in Connecting state for a grace period of 63 seconds and after that an attempt
to fence the host will be issued.
2018-09-05 03:01:11,945+03 INFO
[org.ovirt.engine.core.vdsbroker.monitoring.PollVmStatsRefresher]
(EE-ManagedThreadFactory-engineScheduled-Thread-57) [] Failed to fetch vms info for host
'v3' - skipping VMs monitoring.
2018-09-05 03:01:48,028+03 WARN
[org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector]
(EE-ManagedThreadFactory-engine-Thread-827679) [] EVENT_ID: VM_SET_TO_UNKNOWN_STATUS(142),
VM vm7 was set to the Unknown status.
2018-09-05 03:02:10,033+03 INFO [org.ovirt.engine.core.bll.pm.StopVdsCommand]
(EE-ManagedThreadFactory-engine-Thread-827680) [30369e01] Power-Management: STOP of host
'v3' initiated.
2018-09-05 03:02:55,935+03 INFO
[org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector]
(EE-ManagedThreadFactory-engine-Thread-827680) [3adcac38] EVENT_ID:
VM_WAS_SET_DOWN_DUE_TO_HOST_REBOOT_OR_MANUAL_FENCE(143), Vm vm7 was shut down due to v3
host reboot or manual fence
2018-09-05 03:02:56,018+03 INFO [org.ovirt.engine.core.bll.pm.StopVdsCommand]
(EE-ManagedThreadFactory-engine-Thread-827680) [ea0f582] Power-Management: STOP host
'v3' succeeded.
2018-09-05 03:08:20,818+03 INFO
[org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector]
(EE-ManagedThreadFactory-engineScheduled-Thread-91) [326878] EVENT_ID: VDS_DETECTED(13),
Status of host v3 was set to Up.
2018-09-05 03:08:23,391+03 INFO [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer]
(EE-ManagedThreadFactory-engineScheduled-Thread-88) [] VM
'3b1262ef-7fff-40af-b85e-9fd01a4f422b'(vm7) was unexpectedly detected as
'Down' on VDS '4970369d-21c2-467d-9247-c73ca2d71b3e'(v3) (expected on
'null')
As you can see, engine does a fence on node v3.
vm7 as well as the others running on that node did not re-start.
any tips?
engine is ovirt-engine-4.2.5.3-1.el7.noarch and host is vdsm-4.20.35-1.el7.x86_64
best regards,
Giannis