
What is your bandwidth threshold for the network used for VM migration ? Can you set a 90 mbit/s threshold (yes, less than 100mbit/s) and try to migrate a small (1 GB RAM) VM ? Do you see disconnects ? If no, try a little bit up (the threshold) and check again. Best Regards, Strahil NikolovOn Aug 23, 2019 23:19, "Curtis E. Combs Jr." <ej.albany@gmail.com> wrote:
It took a while for my servers to come back on the network this time. I think it's due to ovirt continuing to try to migrate the VMs around like I requested. The 3 servers' names are "swm-01, swm-02 and swm-03". Eventually (about 2-3 minutes ago) they all came back online.
So I disabled and stopped the lldpad service.
Nope. Started some more migrations and swm-02 and swm-03 disappeared again. No ping, SSH hung, same as before - almost as soon as the migration started.
If you wall have any ideas what switch-level setting might be enabled, let me know, cause I'm stumped. I can add it to the ticket that's requesting the port configurations. I've already added the port numbers and switch name that I got from CDP.
Thanks again, I really appreciate the help! cecjr
On Fri, Aug 23, 2019 at 3:28 PM Dominik Holler <dholler@redhat.com> wrote:
On Fri, Aug 23, 2019 at 9:19 PM Dominik Holler <dholler@redhat.com> wrote:
On Fri, Aug 23, 2019 at 8:03 PM Curtis E. Combs Jr. <ej.albany@gmail.com> wrote:
This little cluster isn't in production or anything like that yet.
So, I went ahead and used your ethtool commands to disable pause frames on both interfaces of each server. I then, chose a few VMs to migrate around at random.
swm-02 and swm-03 both went out again. Unreachable. Can't ping, can't ssh, and the SSH session that I had open was unresponsive.
Any other ideas?
Sorry, no. Looks like two different NICs with different drivers and frimware goes down together. This is a strong indication that the root cause is related to the switch. Maybe you can get some information about the switch config by 'lldptool get-tlv -n -i em1'
Another guess: After the optional 'lldptool get-tlv -n -i em1' 'systemctl stop lldpad' another try to migrate.
On Fri, Aug 23, 2019 at 1:50 PM Dominik Holler <dholler@redhat.com> wrote:
On Fri, Aug 23, 2019 at 6:45 PM Curtis E. Combs Jr. <ej.albany@gmail.com> wrote:
Unfortunately, I can't check on the switch. Trust me, I've tried. These servers are in a Co-Lo and I've put 5 tickets in asking about the port configuration. They just get ignored - but that's par for the coarse for IT here. Only about 2 out of 10 of our tickets get any response and usually the response doesn't help. Then the system they use auto-closes the ticket. That was why I was suspecting STP before.
I can do ethtool. I do have root on these servers, though. Are you trying to get me to turn off link-speed auto-negotiation? Would you like me to try that?
It is just a suspicion, that the reason is pause frames. Let's start on a NIC which is not used for ovirtmgmt, I guess em1. Does 'ethtool -S em1 | grep pause' show something? Does 'ethtool em1 | grep pause' indicates support for pause? The current config is shown by 'ethtool -a em1'. '-A autoneg' "Specifies whether pause autonegotiation should be enabled." according to ethtool doc. Assuming flow control is enabled by default, I would try to disable it via 'ethtool -A em1 autoneg off rx off tx off' and check if it is applied via 'ethtool -a em1' and check if the behavior under load changes.
On Fri, Aug 23, 2019 at 12:24 PM Dominik Holler <dholler@redhat.com> wrote: > > > > On Fri, Aug 23, 2019 at 5:49 PM Curtis E. Combs Jr. <ej.albany@gmail.com> wrote: >> >> Sure! Right now, I only have a 5