Host nodes (centos 7.5) named ovirtnode1,5,6. Timeouts (in ha agent)
are
default. Sanlock are configured (as i think)
HE running on ovirtnode6, and spare HE deployed on ovirtnode1.
Fixed (as seems) by guestfish/xfs_repair method. It requires to zero xfs
metadata logs, and this heavily relies on luck.
1. Why (when it cannot boot due to corruption) it NOT show anything at
all in console?
I can get to grub menu (if moving fast enough), but if I continue boot I
see a blinking cursor for many minutes and not more. Grub options not
contain any splash/quiet parameters.
(exclusion for EDD message - it is meaningless, if I use edd=off - I get
only black console).
Where is a kernel boot logs/console output? Are it try to load initrd at
least?
2. How to set some timeouts for ha-agent NOT to try restart HE after 1-2
unsuccessful pings and 10 seconds outage?
For HE VM stability (not crash/broke fs) are more important instead
availability (I can live with unavailable it for 10-15 sec, but cannot
with broken VM).
---------------------------
Dec 21 12:32:56 ovirtnode6 kernel: bnx2x 0000:3b:00.0 enp59s0f0: NIC
Link is Down
Dec 21 12:32:56 ovirtnode6 kernel: ovirtmgmt: port 1(enp59s0f0) entered
disabled state
Dec 21 12:33:13 ovirtnode6 kernel: bnx2x 0000:3b:00.0 enp59s0f0: NIC
Link is Up, 10000 Mbps full duplex, Flow control: ON - receive & transmit
Dec 21 12:33:13 ovirtnode6 kernel: IPv6: ADDRCONF(NETDEV_CHANGE):
enp59s0f0: link becomes ready
Dec 21 12:33:13 ovirtnode6 kernel: ovirtmgmt: port 1(enp59s0f0) entered
forwarding state
Dec 21 12:33:13 ovirtnode6 NetworkManager[1715]: <info>
[1545381193.2204] device (enp59s0f0): carrier: link connected
-----------------------
There is 17 second. at 33:13 link are back. BUT all events lead to crash
follow later:
HA agent log:
------------------------------
MainThread::INFO::2018-12-21
12:32:59,540::states::444::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(consume)
Engine vm running on localhost
MainThread::INFO::2018-12-21
12:32:59,662::hosted_engine::491::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop)
Current state EngineUp (score: 3400)
MainThread::INFO::2018-12-21
12:33:09,797::states::136::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(score)
Penalizing score by 1280 due to gateway status
MainThread::INFO::2018-12-21
12:33:09,798::hosted_engine::491::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop)
Current state EngineUp (score: 2120)
MainThread::ERROR::2018-12-21
12:33:19,815::states::436::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(consume)
Host ovirtnode1.miac (id 1) score is significantly better than local
score, shutting down VM on this host
----------------------------------------------
syslog messages:
Dec 21 12:33:19 ovirtnode6 journal: ovirt-ha-agent
ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine ERROR Host
ovirtnode1.miac (id 1) score is significantly better than local score,
shutting down VM on this host
Dec 21 12:33:29 ovirtnode6 journal: ovirt-ha-agent
ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine ERROR Engine VM
stopped on localhost
Dec 21 12:33:37 ovirtnode6 kernel: ovirtmgmt: port 3(vnet1) entered
disabled state
Dec 21 12:33:37 ovirtnode6 kernel: device vnet1 left promiscuous mode
Dec 21 12:33:37 ovirtnode6 kernel: ovirtmgmt: port 3(vnet1) entered
disabled state
Dec 21 12:33:37 ovirtnode6 NetworkManager[1715]: <info>
[1545381217.1796] device (vnet1): state change: disconnected ->
unmanaged (reason 'unmanaged', sys-iface-state: 'removed')
Dec 21 12:33:37 ovirtnode6 NetworkManager[1715]: <info>
[1545381217.1798] device (vnet1): released from master device ovirtmgmt
Dec 21 12:33:37 ovirtnode6 libvirtd: 2018-12-21 08:33:37.192+0000: 2783:
**************error : qemuMonitorIO:719 : internal error: End of file
from qemu monitor************* - WHAT IS THIS?
Dec 21 12:33:37 ovirtnode6 kvm: 2 guests now active
Dec 21 12:33:37 ovirtnode6 systemd-machined: Machine qemu-2-HostedEngine
terminated.
Dec 21 12:33:37 ovirtnode6 firewalld[1693]: WARNING: COMMAND_FAILED:
'/usr/sbin/iptables -w2 -w -D libvirt-out -m physdev
--physdev-is-bridged --physdev-out vnet1 -g FP-vnet1' failed: iptables
v1.4.21: goto 'FP-vnet1' is not a chain#012#0
12Try `iptables -h' or 'iptables --help' for more information.
Dec 21 12:33:55 ovirtnode6 kernel: ovirtmgmt: port 3(vnet1) entered
blocking state
Dec 21 12:33:55 ovirtnode6 kernel: ovirtmgmt: port 3(vnet1) entered
disabled state
Dec 21 12:33:55 ovirtnode6 kernel: device vnet1 entered promiscuous mode
Dec 21 12:33:55 ovirtnode6 kernel: ovirtmgmt: port 3(vnet1) entered
blocking state
Dec 21 12:33:55 ovirtnode6 kernel: ovirtmgmt: port 3(vnet1) entered
forwarding state
Dec 21 12:33:55 ovirtnode6 lldpad: recvfrom(Event interface): No buffer
space available
Dec 21 12:33:55 ovirtnode6 NetworkManager[1715]: <info>
[1545381235.8086] manager: (vnet1): new Tun device
(/org/freedesktop/NetworkManager/Devices/37)
Dec 21 12:33:55 ovirtnode6 NetworkManager[1715]: <info>
[1545381235.8121] device (vnet1): state change: unmanaged -> unavailable
(reason 'connection-assumed', sys-iface-state: 'external')
Dec 21 12:33:55 ovirtnode6 NetworkManager[1715]: <info>
[1545381235.8127] device (vnet1): state change: unavailable ->
disconnected (reason 'none', sys-iface-state: 'external')
---------------------------
*** WHAT IS THIS *** ? Link are ready some time ago, why this bridge
status transitions and iptables errors are happen?
and this machine try to start again:
-----------------
Dec 21 12:33:56 ovirtnode6 systemd-machined: New machine
qemu-15-HostedEngine.
Dec 21 12:33:56 ovirtnode6 systemd: Started Virtual Machine
qemu-15-HostedEngine.
-------------------
HA agent log on this host (ovirtnode6):
-----------------------------------------
MainThread::INFO::2018-12-21
12:33:49,880::states::510::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(consume)
Engine down and local host has best score (3400), attempting to start
engine VM
MainThread::INFO::2018-12-21
12:33:57,884::hosted_engine::491::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop)
Current state EngineStarting (score: 3400)
MainThread::INFO::2018-12-21
12:34:04,898::states::779::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(consume)
VM is powering up..
.....
MainThread::INFO::2018-12-21
12:36:24,800::states::779::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(consume)
VM is powering up..
MainThread::INFO::2018-12-21
12:36:24,921::hosted_engine::491::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop)
Current state EngineStarting (score: 3400)
---------------------------
HA Agent log on ovirtnode1 (spare HE host where VM trying to start):
----------------------
MainThread::INFO::2018-12-21
12:33:52,984::states::510::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(consume)
Engine down and local host has best score (3400), attempting to start
engine VM
MainThread::INFO::2018-12-21
12:33:56,787::hosted_engine::947::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_start_engine_vm)
Engine VM started on localhost
MainThread::INFO::2018-12-21
12:33:59,923::brokerlink::68::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify)
Success, was notification of state_transition
(EngineStart-EngineStarting) sent? sent
MainThread::INFO::2018-12-21
12:33:59,936::hosted_engine::491::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop)
Current state EngineStarting (score: 3400)
MainThread::INFO::2018-12-21
12:34:06,950::states::783::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(consume)
Another host already took over..
-----------------
WHAT IS THIS? what in means about "took over", what process it do in
this case?
------------------------------------
MainThread::INFO::2018-12-21
12:34:10,240::hosted_engine::491::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop)
Current state EngineForceStop (score: 3400)
MainThread::INFO::2018-12-21
12:34:10,246::hosted_engine::1006::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_stop_engine_vm)
Shutting down vm using `/usr/sbin/hosted-engine --vm-poweroff`
MainThread::INFO::2018-12-21
12:34:10,797::hosted_engine::1011::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_stop_engine_vm)
stdout:
MainThread::INFO::2018-12-21
12:34:10,797::hosted_engine::1012::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_stop_engine_vm)
stderr:
MainThread::ERROR::2018-12-21
12:34:10,797::hosted_engine::1020::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_stop_engine_vm)
Engine VM stopped on localhost
-------------------------------------
I mount HE VM partitions with log, here is a syslog messages:
WHAT IS THIS guest-shutdown? why? no network problems at this time
Dec 21 12:28:24 ovirtengine ovsdb-server:
ovs|36386|reconnect|WARN|ssl:[::ffff:127.0.0.1]:58834: connection
dropped (Protocol error)
Dec 21 12:28:24 ovirtengine python: ::ffff:172.16.10.101 - -
[21/Dec/2018 12:28:24] "GET /v2.0/networks HTTP/1.1" 200 -
Dec 21 12:30:44 ovirtengine ovsdb-server:
ovs|00111|reconnect|ERR|ssl:[::ffff:172.16.10.5]:42032: no response to
inactivity probe after 5 seconds, disconnecting
Dec 21 12:30:44 ovirtengine ovsdb-server:
ovs|00112|reconnect|ERR|ssl:[::ffff:172.16.10.1]:45624: no response to
inactivity probe after 5 seconds, disconnecting
Dec 21 12:31:07 ovirtengine qemu-ga: info: guest-shutdown called, mode:
powerdown
Dec 21 12:33:58 ovirtengine kernel: Linux version
3.10.0-862.14.4.el7.x86_64 (mockbuild(a)kbuilder.bsys.centos.org) (gcc
version 4.8.5 20150623 (Red Hat 4.8.5-28) (GCC) ) #1 SMP Wed Sep 26
15:12:11 UTC 2018
Dec 21 12:33:58 ovirtengine kernel: Command line:
BOOT_IMAGE=/vmlinuz-3.10.0-862.14.4.el7.x86_64
root=UUID=091e7022-295b-4b3f-96ad-4a7d90a2a9b0 ro crashkernel=auto
rd.lvm.lv=ovirt/swap console=ttyS0 LANG=en_US.UTF-8
.....
Dec 21 12:33:59 ovirtengine kernel: XFS (vda3): Ending clean mount
Dec 21 12:34:01 ovirtengine systemd: Mounted /sysroot.
....
Dec 21 12:34:06 ovirtengine lvm: 6 logical volume(s) in volume group
"ovirt" now active
Dec 21 12:34:06 ovirtengine systemd: Started LVM2 PV scan on device 252:2.
Dec 21 12:34:06 ovirtengine systemd: Found device /dev/mapper/ovirt-audit.
Dec 21 12:34:06 ovirtengine kernel: XFS (dm-5): Ending clean mount
Dec 21 12:34:06 ovirtengine systemd: Mounted /home.
Dec 21 12:34:07 ovirtengine kernel: XFS (dm-3): Ending clean mount
Dec 21 12:34:07 ovirtengine systemd: Mounted /var.
Dec 21 12:34:07 ovirtengine systemd: Starting Load/Save Random Seed...
Dec 21 12:34:07 ovirtengine systemd: Mounting /var/log...
Dec 21 12:34:07 ovirtengine kernel: XFS (dm-2): Mounting V5 Filesystem
Dec 21 12:34:07 ovirtengine systemd: Started Load/Save Random Seed.
Dec 21 12:34:08 ovirtengine kernel: XFS (dm-2): Ending clean mount
Dec 21 12:34:08 ovirtengine systemd: Mounted /var/log.
Dec 21 12:34:08 ovirtengine systemd: Starting Flush Journal to
Persistent Storage...
Dec 21 12:34:08 ovirtengine systemd: Mounting /var/log/audit...
Dec 21 12:34:08 ovirtengine kernel: XFS (dm-1): Mounting V5 Filesystem
Dec 21 12:34:08 ovirtengine systemd: Started Flush Journal to Persistent
Storage.
Dec 21 12:34:08 ovirtengine kernel: XFS (dm-1): Ending clean mount
Dec 21 12:34:08 ovirtengine systemd: Mounted /var/log/audit.
Dec 21 12:34:13 ovirtengine kernel: XFS (dm-4): Ending clean mount
Dec 21 12:34:13 ovirtengine systemd: Mounted /tmp.
Dec 21 12:34:13 ovirtengine systemd: Reached target Local File Systems.
.....
Dec 21 12:34:24 ovirtengine sshd[1324]: Server listening on 0.0.0.0 port
2222.
Dec 21 12:34:24 ovirtengine sshd[1324]: Server listening on :: port 2222.
Dec 21 12:34:25 ovirtengine rsyslogd: [origin software="rsyslogd"
swVersion="8.24.0" x-pid="1334"
x-info="http://www.rsyslog.com"] start
Dec 21 12:34:25 ovirtengine systemd: Started System Logging Service.
......
Dec 21 12:34:25 ovirtengine aliasesdb: BDB0196 Encrypted checksum: no
encryption key specified
Dec 21 12:34:25 ovirtengine aliasesdb: BDB0196 Encrypted checksum: no
encryption key specified
Dec 21 12:34:25 ovirtengine kernel: XFS (vda3): Internal error
XFS_WANT_CORRUPTED_GOTO at line 1664 of file fs/xfs/libxfs/xfs_alloc.c.
Caller xfs_free_extent+0xaa/0x140 [xfs]
Dec 21 12:34:25 ovirtengine kernel: CPU: 1 PID: 1379 Comm: postalias Not
tainted 3.10.0-862.14.4.el7.x86_64 #1
Dec 21 12:34:25 ovirtengine kernel: Hardware name: oVirt oVirt Node,
BIOS 1.11.0-2.el7 04/01/2014
Dec 21 12:34:25 ovirtengine kernel: XFS (vda3):
xfs_do_force_shutdown(0x8) called from line 236 of file
fs/xfs/libxfs/xfs_defer.c. Return address = 0xffffffffc02709cb
Dec 21 12:34:27 ovirtengine kernel: XFS (vda3): Corruption of in-memory
data detected. Shutting down filesystem
Dec 21 12:34:27 ovirtengine kernel: XFS (vda3): Please umount the
filesystem and rectify the problem(s)
......
Dec 21 12:34:27 ovirtengine ovirt-websocket-proxy.py: ImportError: No
module named websocketproxy
Dec 21 12:34:27 ovirtengine journal: 2018-12-21 12:34:27,189+0400
ovirt-engine-dwhd: ERROR run:554 Error: list index out of range
Dec 21 12:34:27 ovirtengine systemd: ovirt-websocket-proxy.service: main
process exited, code=killed, status=7/BUS
Dec 21 12:34:27 ovirtengine systemd: Failed to start oVirt Engine
websockets proxy.
Dec 21 12:34:27 ovirtengine systemd: Unit ovirt-websocket-proxy.service
entered failed state.
Dec 21 12:34:27 ovirtengine systemd: ovirt-websocket-proxy.service failed.
Dec 21 12:34:27 ovirtengine systemd: ovirt-engine-dwhd.service: main
process exited, code=exited, status=1/FAILURE
......
----------------------------
_______________________________________________
Users mailing list -- users(a)ovirt.org
To unsubscribe send an email to users-leave(a)ovirt.org
Privacy Statement:
https://www.ovirt.org/site/privacy-policy/
oVirt Code of Conduct:
https://www.ovirt.org/community/about/community-guidelines/
List Archives:
https://lists.ovirt.org/archives/list/users@ovirt.org/message/BBHBZ6ZWU5K...