21.12.2018 14:24, Mike Lykov пишет:
I have a 4.2.7 setup hyperconverged, two deployed VM Engine images and i
have 20-30 second network outage. After some pinging to start engine on
host 1, then 2, then again 1 Engine image stuck at
"Probing EDD (edd=off to disable)... _"
as here:
https://bugzilla.redhat.com/show_bug.cgi?id=1569827
Now I looking to the logs.
Full /var/log archives are here:
https://yadi.sk/d/XZ5jJfQLN6QMlA (HE engine logs) - 36 Mb
https://yadi.sk/d/bZ0TYGxFoHGgIQ (ovirtnode6 logs) - 144 Mb
I do some CCs in this email to personal addresses, if i's not relevant -
please ignore.
Host nodes (centos 7.5) named ovirtnode1,5,6. Timeouts (in ha agent) are
default. Sanlock are configured (as i think)
HE running on ovirtnode6, and spare HE deployed on ovirtnode1.
There is two network links: ovirtmgmt over "ovirtmgmt: port
1(enp59s0f0)" and glusterfs storage network over ib0 interface
(different subnet)
messages log on ovirtnode6:
That outage:
---------------------------
Dec 21 12:32:56 ovirtnode6 kernel: bnx2x 0000:3b:00.0 enp59s0f0: NIC
Link is Down
Dec 21 12:32:56 ovirtnode6 kernel: ovirtmgmt: port 1(enp59s0f0) entered
disabled state
Dec 21 12:33:13 ovirtnode6 kernel: bnx2x 0000:3b:00.0 enp59s0f0: NIC
Link is Up, 10000 Mbps full duplex, Flow control: ON - receive & transmit
Dec 21 12:33:13 ovirtnode6 kernel: IPv6: ADDRCONF(NETDEV_CHANGE):
enp59s0f0: link becomes ready
Dec 21 12:33:13 ovirtnode6 kernel: ovirtmgmt: port 1(enp59s0f0) entered
forwarding state
Dec 21 12:33:13 ovirtnode6 NetworkManager[1715]: <info>
[1545381193.2204] device (enp59s0f0): carrier: link connected
-----------------------
There is 17 second. at 33:13 link are back. BUT all events lead to crash
follow later:
HA agent log:
------------------------------
MainThread::INFO::2018-12-21
12:32:59,540::states::444::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(consume)
Engine vm running on localhost
MainThread::INFO::2018-12-21
12:32:59,662::hosted_engine::491::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop)
Current state EngineUp (score: 3400)
MainThread::INFO::2018-12-21
12:33:09,797::states::136::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(score)
Penalizing score by 1280 due to gateway status
MainThread::INFO::2018-12-21
12:33:09,798::hosted_engine::491::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop)
Current state EngineUp (score: 2120)
MainThread::ERROR::2018-12-21
12:33:19,815::states::436::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(consume)
Host ovirtnode1.miac (id 1) score is significantly better than local
score, shutting down VM on this host
----------------------------------------------
syslog messages:
Dec 21 12:33:19 ovirtnode6 journal: ovirt-ha-agent
ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine ERROR Host
ovirtnode1.miac (id 1) score is significantly better than local score,
shutting down VM on this host
Dec 21 12:33:29 ovirtnode6 journal: ovirt-ha-agent
ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine ERROR Engine VM
stopped on localhost
Dec 21 12:33:37 ovirtnode6 kernel: ovirtmgmt: port 3(vnet1) entered
disabled state
Dec 21 12:33:37 ovirtnode6 kernel: device vnet1 left promiscuous mode
Dec 21 12:33:37 ovirtnode6 kernel: ovirtmgmt: port 3(vnet1) entered
disabled state
Dec 21 12:33:37 ovirtnode6 NetworkManager[1715]: <info>
[1545381217.1796] device (vnet1): state change: disconnected ->
unmanaged (reason 'unmanaged', sys-iface-state: 'removed')
Dec 21 12:33:37 ovirtnode6 NetworkManager[1715]: <info>
[1545381217.1798] device (vnet1): released from master device ovirtmgmt
Dec 21 12:33:37 ovirtnode6 libvirtd: 2018-12-21 08:33:37.192+0000: 2783:
**************error : qemuMonitorIO:719 : internal error: End of file
from qemu monitor************* - WHAT IS THIS?
Dec 21 12:33:37 ovirtnode6 kvm: 2 guests now active
Dec 21 12:33:37 ovirtnode6 systemd-machined: Machine qemu-2-HostedEngine
terminated.
Dec 21 12:33:37 ovirtnode6 firewalld[1693]: WARNING: COMMAND_FAILED:
'/usr/sbin/iptables -w2 -w -D libvirt-out -m physdev
--physdev-is-bridged --physdev-out vnet1 -g FP-vnet1' failed: iptables
v1.4.21: goto 'FP-vnet1' is not a chain#012#0
12Try `iptables -h' or 'iptables --help' for more information.
Dec 21 12:33:55 ovirtnode6 kernel: ovirtmgmt: port 3(vnet1) entered
blocking state
Dec 21 12:33:55 ovirtnode6 kernel: ovirtmgmt: port 3(vnet1) entered
disabled state
Dec 21 12:33:55 ovirtnode6 kernel: device vnet1 entered promiscuous mode
Dec 21 12:33:55 ovirtnode6 kernel: ovirtmgmt: port 3(vnet1) entered
blocking state
Dec 21 12:33:55 ovirtnode6 kernel: ovirtmgmt: port 3(vnet1) entered
forwarding state
Dec 21 12:33:55 ovirtnode6 lldpad: recvfrom(Event interface): No buffer
space available
Dec 21 12:33:55 ovirtnode6 NetworkManager[1715]: <info>
[1545381235.8086] manager: (vnet1): new Tun device
(/org/freedesktop/NetworkManager/Devices/37)
Dec 21 12:33:55 ovirtnode6 NetworkManager[1715]: <info>
[1545381235.8121] device (vnet1): state change: unmanaged -> unavailable
(reason 'connection-assumed', sys-iface-state: 'external')
Dec 21 12:33:55 ovirtnode6 NetworkManager[1715]: <info>
[1545381235.8127] device (vnet1): state change: unavailable ->
disconnected (reason 'none', sys-iface-state: 'external')
---------------------------
*** WHAT IS THIS *** ? Link are ready some time ago, why this bridge
status transitions and iptables errors are happen?
and this machine try to start again:
-----------------
Dec 21 12:33:56 ovirtnode6 systemd-machined: New machine
qemu-15-HostedEngine.
Dec 21 12:33:56 ovirtnode6 systemd: Started Virtual Machine
qemu-15-HostedEngine.
-------------------
HA agent log on this host (ovirtnode6):
-----------------------------------------
MainThread::INFO::2018-12-21
12:33:49,880::states::510::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(consume)
Engine down and local host has best score (3400), attempting to start
engine VM
MainThread::INFO::2018-12-21
12:33:57,884::hosted_engine::491::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop)
Current state EngineStarting (score: 3400)
MainThread::INFO::2018-12-21
12:34:04,898::states::779::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(consume)
VM is powering up..
.....
MainThread::INFO::2018-12-21
12:36:24,800::states::779::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(consume)
VM is powering up..
MainThread::INFO::2018-12-21
12:36:24,921::hosted_engine::491::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop)
Current state EngineStarting (score: 3400)
---------------------------
HA Agent log on ovirtnode1 (spare HE host where VM trying to start):
----------------------
MainThread::INFO::2018-12-21
12:33:52,984::states::510::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(consume)
Engine down and local host has best score (3400), attempting to start
engine VM
MainThread::INFO::2018-12-21
12:33:56,787::hosted_engine::947::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_start_engine_vm)
Engine VM started on localhost
MainThread::INFO::2018-12-21
12:33:59,923::brokerlink::68::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify)
Success, was notification of state_transition
(EngineStart-EngineStarting) sent? sent
MainThread::INFO::2018-12-21
12:33:59,936::hosted_engine::491::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop)
Current state EngineStarting (score: 3400)
MainThread::INFO::2018-12-21
12:34:06,950::states::783::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(consume)
Another host already took over..
-----------------
WHAT IS THIS? what in means about "took over", what process it do in
this case?
------------------------------------
MainThread::INFO::2018-12-21
12:34:10,240::hosted_engine::491::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop)
Current state EngineForceStop (score: 3400)
MainThread::INFO::2018-12-21
12:34:10,246::hosted_engine::1006::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_stop_engine_vm)
Shutting down vm using `/usr/sbin/hosted-engine --vm-poweroff`
MainThread::INFO::2018-12-21
12:34:10,797::hosted_engine::1011::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_stop_engine_vm)
stdout:
MainThread::INFO::2018-12-21
12:34:10,797::hosted_engine::1012::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_stop_engine_vm)
stderr:
MainThread::ERROR::2018-12-21
12:34:10,797::hosted_engine::1020::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_stop_engine_vm)
Engine VM stopped on localhost
-------------------------------------
I mount HE VM partitions with log, here is a syslog messages:
WHAT IS THIS guest-shutdown? why? no network problems at this time
Dec 21 12:28:24 ovirtengine ovsdb-server:
ovs|36386|reconnect|WARN|ssl:[::ffff:127.0.0.1]:58834: connection
dropped (Protocol error)
Dec 21 12:28:24 ovirtengine python: ::ffff:172.16.10.101 - -
[21/Dec/2018 12:28:24] "GET /v2.0/networks HTTP/1.1" 200 -
Dec 21 12:30:44 ovirtengine ovsdb-server:
ovs|00111|reconnect|ERR|ssl:[::ffff:172.16.10.5]:42032: no response to
inactivity probe after 5 seconds, disconnecting
Dec 21 12:30:44 ovirtengine ovsdb-server:
ovs|00112|reconnect|ERR|ssl:[::ffff:172.16.10.1]:45624: no response to
inactivity probe after 5 seconds, disconnecting
Dec 21 12:31:07 ovirtengine qemu-ga: info: guest-shutdown called, mode:
powerdown
Dec 21 12:33:58 ovirtengine kernel: Linux version
3.10.0-862.14.4.el7.x86_64 (mockbuild(a)kbuilder.bsys.centos.org) (gcc
version 4.8.5 20150623 (Red Hat 4.8.5-28) (GCC) ) #1 SMP Wed Sep 26
15:12:11 UTC 2018
Dec 21 12:33:58 ovirtengine kernel: Command line:
BOOT_IMAGE=/vmlinuz-3.10.0-862.14.4.el7.x86_64
root=UUID=091e7022-295b-4b3f-96ad-4a7d90a2a9b0 ro crashkernel=auto
rd.lvm.lv=ovirt/swap console=ttyS0 LANG=en_US.UTF-8
.....
Dec 21 12:33:59 ovirtengine kernel: XFS (vda3): Ending clean mount
Dec 21 12:34:01 ovirtengine systemd: Mounted /sysroot.
....
Dec 21 12:34:06 ovirtengine lvm: 6 logical volume(s) in volume group
"ovirt" now active
Dec 21 12:34:06 ovirtengine systemd: Started LVM2 PV scan on device 252:2.
Dec 21 12:34:06 ovirtengine systemd: Found device /dev/mapper/ovirt-audit.
Dec 21 12:34:06 ovirtengine kernel: XFS (dm-5): Ending clean mount
Dec 21 12:34:06 ovirtengine systemd: Mounted /home.
Dec 21 12:34:07 ovirtengine kernel: XFS (dm-3): Ending clean mount
Dec 21 12:34:07 ovirtengine systemd: Mounted /var.
Dec 21 12:34:07 ovirtengine systemd: Starting Load/Save Random Seed...
Dec 21 12:34:07 ovirtengine systemd: Mounting /var/log...
Dec 21 12:34:07 ovirtengine kernel: XFS (dm-2): Mounting V5 Filesystem
Dec 21 12:34:07 ovirtengine systemd: Started Load/Save Random Seed.
Dec 21 12:34:08 ovirtengine kernel: XFS (dm-2): Ending clean mount
Dec 21 12:34:08 ovirtengine systemd: Mounted /var/log.
Dec 21 12:34:08 ovirtengine systemd: Starting Flush Journal to
Persistent Storage...
Dec 21 12:34:08 ovirtengine systemd: Mounting /var/log/audit...
Dec 21 12:34:08 ovirtengine kernel: XFS (dm-1): Mounting V5 Filesystem
Dec 21 12:34:08 ovirtengine systemd: Started Flush Journal to Persistent
Storage.
Dec 21 12:34:08 ovirtengine kernel: XFS (dm-1): Ending clean mount
Dec 21 12:34:08 ovirtengine systemd: Mounted /var/log/audit.
Dec 21 12:34:13 ovirtengine kernel: XFS (dm-4): Ending clean mount
Dec 21 12:34:13 ovirtengine systemd: Mounted /tmp.
Dec 21 12:34:13 ovirtengine systemd: Reached target Local File Systems.
.....
Dec 21 12:34:24 ovirtengine sshd[1324]: Server listening on 0.0.0.0 port
2222.
Dec 21 12:34:24 ovirtengine sshd[1324]: Server listening on :: port 2222.
Dec 21 12:34:25 ovirtengine rsyslogd: [origin software="rsyslogd"
swVersion="8.24.0" x-pid="1334"
x-info="http://www.rsyslog.com"] start
Dec 21 12:34:25 ovirtengine systemd: Started System Logging Service.
......
Dec 21 12:34:25 ovirtengine aliasesdb: BDB0196 Encrypted checksum: no
encryption key specified
Dec 21 12:34:25 ovirtengine aliasesdb: BDB0196 Encrypted checksum: no
encryption key specified
Dec 21 12:34:25 ovirtengine kernel: XFS (vda3): Internal error
XFS_WANT_CORRUPTED_GOTO at line 1664 of file fs/xfs/libxfs/xfs_alloc.c.
Caller xfs_free_extent+0xaa/0x140 [xfs]
Dec 21 12:34:25 ovirtengine kernel: CPU: 1 PID: 1379 Comm: postalias Not
tainted 3.10.0-862.14.4.el7.x86_64 #1
Dec 21 12:34:25 ovirtengine kernel: Hardware name: oVirt oVirt Node,
BIOS 1.11.0-2.el7 04/01/2014
Dec 21 12:34:25 ovirtengine kernel: XFS (vda3):
xfs_do_force_shutdown(0x8) called from line 236 of file
fs/xfs/libxfs/xfs_defer.c. Return address = 0xffffffffc02709cb
Dec 21 12:34:27 ovirtengine kernel: XFS (vda3): Corruption of in-memory
data detected. Shutting down filesystem
Dec 21 12:34:27 ovirtengine kernel: XFS (vda3): Please umount the
filesystem and rectify the problem(s)
......
Dec 21 12:34:27 ovirtengine ovirt-websocket-proxy.py: ImportError: No
module named websocketproxy
Dec 21 12:34:27 ovirtengine journal: 2018-12-21 12:34:27,189+0400
ovirt-engine-dwhd: ERROR run:554 Error: list index out of range
Dec 21 12:34:27 ovirtengine systemd: ovirt-websocket-proxy.service: main
process exited, code=killed, status=7/BUS
Dec 21 12:34:27 ovirtengine systemd: Failed to start oVirt Engine
websockets proxy.
Dec 21 12:34:27 ovirtengine systemd: Unit ovirt-websocket-proxy.service
entered failed state.
Dec 21 12:34:27 ovirtengine systemd: ovirt-websocket-proxy.service failed.
Dec 21 12:34:27 ovirtengine systemd: ovirt-engine-dwhd.service: main
process exited, code=exited, status=1/FAILURE
......
----------------------------