Il giorno mar 17 nov 2020 alle ore 16:01 Anton Louw <Anton.Louw@voxtelecom.co.za> ha scritto:


Hi Sandro,

 

Have you perhaps seen anything in the SOS report that could shed some light on the issues?


Sadly no. I see it's oVirt Node 4.3.8, I can suggest to upgrade to 4.3.10 at least and consider upgrading to 4.4.3 the whole datacenter.
I had the feeling watchdog was the trigger of the reboot but couldn't find any evidence.
I also don't see anything suspicious in the logs.


 

 

Thanks

 


Anton Louw
Cloud Engineer: Storage and Virtualization at Vox

T:  087 805 0000 | D: 087 805 1572
M: N/A
E: anton.louw@voxtelecom.co.za
A: Rutherford Estate, 1 Scott Street, Waverley, Johannesburg
www.vox.co.za

F
 
T
 
I
 
L
 
Y
 

From: Anton Louw
Sent: 16 November 2020 07:30
To: Sandro Bonazzola <sbonazzo@redhat.com>; Arik Hadas <ahadas@redhat.com>; Dominik Holler <dholler@redhat.com>
Cc: users@ovirt.org; Johan Koen <Johan.Koen@voxtelecom.co.za>
Subject: RE: [ovirt-users] oVirt Node Crash

 

I have also attached the SOS report as requested

 

From: Anton Louw
Sent: 16 November 2020 06:54
To: Sandro Bonazzola <sbonazzo@redhat.com>; Arik Hadas <ahadas@redhat.com>; Dominik Holler <dholler@redhat.com>
Cc: users@ovirt.org; Johan Koen <Johan.Koen@voxtelecom.co.za>
Subject: RE: [ovirt-users] oVirt Node Crash

 

Hi Sandro,

 

Thanks for the response. I logged onto oVirt this morning, and I see the node is in a “Unassigned” state. I can ping it, but cannot SSH, so there is something that is causing the host to be unresponsive.

 

On Saturday after I sent the mail, I opened a console to the node, and I saw the below entries before logging in:

 

audit:backlog limit exceeded

 

I the tried the solution of increasing the buffer size in the audit.rules file in /etc/audit/rules.d/ , as per below, but it did not resolve the issue.

 

## First rule - delete all

-D

 

## Increase the buffers to survive stress events.

## Make this bigger for busy systems

-b 8192

 

## Set failure mode to syslog

-f 1

 

Is it possible to upgrade the node to 4.4 while the engine is still on 4.3?

 

Thanks

 

From: Sandro Bonazzola <sbonazzo@redhat.com>
Sent: 13 November 2020 18:39
To: Anton Louw <Anton.Louw@voxtelecom.co.za>; Arik Hadas <ahadas@redhat.com>; Dominik Holler <dholler@redhat.com>
Cc: users@ovirt.org; Johan Koen <Johan.Koen@voxtelecom.co.za>
Subject: Re: [ovirt-users] oVirt Node Crash

 

 

 

Il giorno ven 13 nov 2020 alle ore 17:37 Sandro Bonazzola <sbonazzo@redhat.com> ha scritto:

 

 

Il giorno ven 13 nov 2020 alle ore 13:38 Anton Louw via Users <users@ovirt.org> ha scritto:

 

Hi Everybody,

 

I have built a new host which has been running fine for the last couple of days. I noticed today that the host crashed, but it is not giving me a reason as to why.

 

It happened at 13:45 today, but I have given time before that on the logs as well.

 

Is there something I am missing here?

 

Not related to the crash, but I see in the logs that 5 out of 20 guests have qemu guest agent not responding.

 

Also you seem to have some issues with some firewalld rules. (Maybe +Dominik Holler would like to have a look)

 

I don't see anything explaining why the host got rebooted.

 

Still related to guest agent I find a bit alarming the following lines:

Nov 13 13:29:34 jb2-node03 libvirtd: 2020-11-13 11:29:34.294+0000: 12603: error : qemuDomainAgentAvailable:9144 : Guest agent is not responding: QEMU guest agent is not connected
Nov 13 13:29:34 jb2-node03 vdsm[13843]: ERROR Shutdown by QEMU Guest Agent failed#012Traceback (most recent call last):#012  File "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 5304, in qemuGuestAgentShutdown#012    self._dom.shutdownFlags(libvirt.VIR_DOMAIN_SHUTDOWN_GUEST_AGENT)#012  File "/usr/lib/python2.7/site-packages/vdsm/virt/virdomain.py", line 100, in f#012    ret = attr(*args, **kwargs)#012  File "/usr/lib/python2.7/site-packages/vdsm/common/libvirtconnection.py", line 131, in wrapper#012    ret = f(*args, **kwargs)#012  File "/usr/lib/python2.7/site-packages/vdsm/common/function.py", line 94, in wrapper#012    return func(inst, *args, **kwargs)#012  File "/usr/lib64/python2.7/site-packages/libvirt.py", line 2517, in shutdownFlags#012    if ret == -1: raise libvirtError ('virDomainShutdownFlags() failed', dom=self)#012libvirtError: Guest agent is not responding: QEMU guest agent is not connected
Nov 13 13:29:42 jb2-node03 kernel: vlan0077: port 11(vnet15) entered disabled state
Nov 13 13:29:42 jb2-node03 kernel: device vnet15 left promiscuous mode
Nov 13 13:29:42 jb2-node03 kernel: vlan0077: port 11(vnet15) entered disabled state
Nov 13 13:29:42 jb2-node03 NetworkManager[6027]: <info>  [1605266982.6539] device (vnet15): state change: disconnected -> unmanaged (reason 'unmanaged', sys-iface-state: 'removed')
Nov 13 13:29:42 jb2-node03 NetworkManager[6027]: <info>  [1605266982.6550] device (vnet15): released from master device vlan0077
Nov 13 13:29:42 jb2-node03 libvirtd: 2020-11-13 11:29:42.669+0000: 12557: error : qemuMonitorIO:718 : internal error: End of file from qemu monitor

 

+Arik Hadas any clue?

 

About the crash, can you please provide full sos report from the host? the log you provided is not enough to understand what caused the reported crash

 

Also, given python2 is used here, I assume you're on 4.3 or older. I would recommend to upgrade to 4.4 as soon as practical.

 

 

 

 

 

 

Thanks

 

Anton Louw

Cloud Engineer: Storage and Virtualization at Vox


T:  087 805 0000 | D: 087 805 1572
M: N/A
E: anton.louw@voxtelecom.co.za
A: Rutherford Estate, 1 Scott Street, Waverley, Johannesburg
www.vox.co.za

 

F

 

T

 

I

 

L

 

Y

 

 

#VoxBrand


Disclaimer

The contents of this email are confidential to the sender and the intended recipient. Unless the contents are clearly and entirely of a personal nature, they are subject to copyright in favour of the holding company of the Vox group of companies. Any recipient who receives this email in error should immediately report the error to the sender and permanently delete this email from all storage devices.

This email has been scanned for viruses and malware, and may have been automatically archived by Mimecast Ltd, an innovator in Software as a Service (SaaS) for business. Providing a safer and more useful place for your human generated data. Specializing in; Security, archiving and compliance. To find out more Click Here.

 

_______________________________________________
Users mailing list -- users@ovirt.org
To unsubscribe send an email to users-leave@ovirt.org
Privacy Statement: https://www.ovirt.org/privacy-policy.html
oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/
List Archives: https://lists.ovirt.org/archives/list/users@ovirt.org/message/XMRUDMRBYZKUJQXVPPAEAJIP7N3JPRLY/


 

--


 

--




--