Hi Tivon,

I think that the most interesting one to see is the /var/log/messages , however I think it's best to simply archive the whole /var/log

Thanks in advance,

On Thu, Jul 15, 2021 at 1:36 PM Tivon Häberlein <tivon.haeberlein@secges.de> wrote:

Hi Lev,

thanks for your reply.
I'll gladly grab the logs in the next couple of days (got to go back to the DC to swap the cards back).

Can you give me a list of logs I should grab so I don't miss any?

-- 
Best regards
Tivon Häberlein
On 15.07.2021 01:25, Lev Veyde wrote:
Hi Tivon,

I personally think that it's worth it to reproduce the issue and get the logs, even though it does really sound like a driver/kernel issue.
That may help get more understanding as to why it happens, and maybe even get the driver/kernel fix.

Thanks in advance,

On Thu, Jul 15, 2021 at 12:38 AM Tivon Häberlein <tivon.haeberlein@secges.de> wrote:
Hi Nathaniel,

thanks for your time here and sorry for my late reply now.

Even though my NICs didn't use the E1000E driver I now got a broadcom NIC from the stash and gave it a try.
I'm happy to announce that the NICs don't seem to be resetting on the broadcom NIC.
This obviously means that there's some driver issue with the Intel NICs I have been trying.

I still don't get the host into operational state because "Failed to connect Host n3 to Storage Pool cl1" even though NFS is mounted properly but I this is a different issue I think.

If you want I can reproduce this issue and grab all logs to maybe find a fix other than "get a broadcom NIC" for the community.
To be honest though, I think this just can be added to the "weird driver fuckups in centos" list if we start digging.
-- 
Best regards
Tivon Häberlein


On 13.07.2021 01:07, Nathaniel Roach via Users wrote:


On 12/7/21 11:44 pm, Nathaniel Roach via Users wrote:

Do you get anything in the logs at all? For something like this I would expect it to show in syslog from the kernel.

It really does sound like the E1000E issue, but will probably have a different fix - I first encountered it on a router when I was pushing >100Mbps in and then back out the same NIC. Otherwise it wouldn't happen at all. That would explain why it's not an issue in maintenance mode and downloading an image works fine.

On 12/7/21 7:57 am, Tivon Häberlein wrote:

Hi Strahil,

the server uses Intel NICs with ixgbe and igb kernel drivers.
I did upgrade the firmware to the latest available one (through Dell lifecycle-contoller).
I also tried replacing the network card itself but without success.

As this issue did not arise when running Debian 10 or even oVirt Node before adding it to the cluster I don't think its hardware related. For my testing I mounted my oVirt Datastore manually on the fresh install of oVirt node (using the ISO) and then coping a large ISO file to the local disk. This fills the NIC up to the full 1 Gbit/s I have available there for a good 5 to 10 minutes.
Also the administration through cockpit works perfectly before adding it to the cluster.

As soon as I add the node to the cluster the trouble starts.
1. oVirt reports that the install has failed on this host
2. the node logs (kernel log) adapter resets on some interfaces (even ones that arent UP)

Having read your message again, are you able to capture these log events before the node gets fenced (or just disable fencing for the time)?

3. the engine looses connection to the host and declares it "Unresponsive"
4. the node becomes unmanageable through cockpit or ssh because the connection is lost repeatedly.
5. the fencing agent reboots the node (If fencing is enabled)
6. node comes up and gets added to the cluster (oVirt says the node is in state UP)
7. repeat from step 2

It seems that this behavior stops when I put the node into maintenance. Then I can even mount the Datastore manually and transfer large ISOs without it dropping the connection.

This is all very strange and I don't understand what causes this.

Thank you.

-- 
Best regards
Tivon Häberlein
On 11.07.2021 13:51, Strahil Nikolov wrote:
Are you sure it's not a HW issue ?
Try to update the server to latest firmware and test again.At least it won't hurt.

Best Regards,
Strahil Nikolov

On Sat, Jul 10, 2021 at 14:45, Tivon Häberlein

Hi,

I've been trying to get oVirt Node 4.4.6 up and running on my Dell r620 hosts but am facing a strange issue where seemingly all network adapters get reset at random times after install.
The interfaces reset as soon as a bit of traffic is flowing through them.
Also the logs show nfs timeouts.

This only happens after I have installed the host using the oVirt engine and it also only happens when the host is connected to the engine. When the host is in maintenance mode it also seems to work fine.

The host and networks work fine when its by itself (I tested right after install using the ISO and also after I have removed the host from the cluster)

I cant figure why this is happening. Am I missing something?
I've been stuck on this for the last couple of weeks, a bit of help would be much appreciated.

Thank you!

My cluster is looking like this:
Engine: oVirt 4.4.6 - CentOS Linux release 8.3.2011
host1: oVirt 4.4 repository on CentOS Linux release 8.4.2105
host2: oVirt 4.4 repository on CentOS Linux release 8.4.2105
host3 (this is the one I'm trying to install): oVirt node 4.4.6

-- 
Best regards
Tivon Häberlein
_______________________________________________
Users mailing list -- users@ovirt.org
To unsubscribe send an email to users-leave@ovirt.org
Privacy Statement: https://www.ovirt.org/privacy-policy.html
oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/
List Archives: https://lists.ovirt.org/archives/list/users@ovirt.org/message/UQP3S4LFWGEP4KL4EUFDZ47WPKT4M6QN/

_______________________________________________
Users mailing list -- users@ovirt.org
To unsubscribe send an email to users-leave@ovirt.org
Privacy Statement: https://www.ovirt.org/privacy-policy.html
oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/
List Archives: https://lists.ovirt.org/archives/list/users@ovirt.org/message/ADWSPMDDO6DJYL7LVKYLHC4KMDTIFMA6/
--

Nathaniel Roach


_______________________________________________
Users mailing list -- users@ovirt.org
To unsubscribe send an email to users-leave@ovirt.org
Privacy Statement: https://www.ovirt.org/privacy-policy.html
oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/
List Archives: https://lists.ovirt.org/archives/list/users@ovirt.org/message/4VCK77N63IFZRNP2NEDS6TRABVGYXCLH/
--

Nathaniel Roach


_______________________________________________
Users mailing list -- users@ovirt.org
To unsubscribe send an email to users-leave@ovirt.org
Privacy Statement: https://www.ovirt.org/privacy-policy.html
oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/
List Archives: https://lists.ovirt.org/archives/list/users@ovirt.org/message/DRECREHLNKLOYMZWCQEVDMEWAR734AJ3/
_______________________________________________
Users mailing list -- users@ovirt.org
To unsubscribe send an email to users-leave@ovirt.org
Privacy Statement: https://www.ovirt.org/privacy-policy.html
oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/
List Archives: https://lists.ovirt.org/archives/list/users@ovirt.org/message/VLI7DD6LIPSIYMQAY57TSGBXP6U3JCNO/


--

Lev Veyde

Senior Software Engineer, RHCE | RHCVA | MCITP

Red Hat Israel

lev@redhat.com | lveyde@redhat.com


_______________________________________________
Users mailing list -- users@ovirt.org
To unsubscribe send an email to users-leave@ovirt.org
Privacy Statement: https://www.ovirt.org/privacy-policy.html
oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/
List Archives: https://lists.ovirt.org/archives/list/users@ovirt.org/message/LCFMVGJVM3MGHHYBDIOFO3QEXOTOYBSI/


--

Lev Veyde

Senior Software Engineer, RHCE | RHCVA | MCITP

Red Hat Israel

lev@redhat.com | lveyde@redhat.com