Well, this proves that the issue is not the bandwidth usage, but something else.

My personal opinion is that you should change the colocation - if that is an option at all ...

Best Regards,
Strahil Nikolov
В събота, 24 август 2019 г., 22:16:18 ч. Гринуич+3, Curtis E. Combs Jr. <ej.albany@gmail.com> написа:


I applied a 90Mbs QOS Rate Limit with 10 set for the shares to both
interfaces of 2 of the hosts. My hosts names are swm-01 and swm-02.

Creating a small VM from a Cinder template and running it gave me a test VM.

When I migrated from swm-01 to swm-02, swm-01 immediately became
unresponsive to pings, SSH'es, and to the ovirt interface which marked
it as "NonResponsive" soon after the VM finished. The VM did finish
migrating, however I'm unsure if that's a good migration or not.

Thank you, Strahil.

On Sat, Aug 24, 2019 at 12:39 PM Strahil <hunter86_bg@yahoo.com> wrote:
>
> What is your bandwidth threshold for the network used for VM migration ?
> Can you set a 90 mbit/s threshold (yes, less than 100mbit/s) and try to migrate a small (1 GB RAM) VM ?
>
> Do you see disconnects ?
>
> If no, try a little bit up (the threshold)  and check again.
>
> Best Regards,
> Strahil NikolovOn Aug 23, 2019 23:19, "Curtis E. Combs Jr." <ej.albany@gmail.com> wrote:
> >
> > It took a while for my servers to come back on the network this time.
> > I think it's due to ovirt continuing to try to migrate the VMs around
> > like I requested. The 3 servers' names are "swm-01, swm-02 and
> > swm-03". Eventually (about 2-3 minutes ago) they all came back online.
> >
> > So I disabled and stopped the lldpad service.
> >
> > Nope. Started some more migrations and swm-02 and swm-03 disappeared
> > again. No ping, SSH hung, same as before - almost as soon as the
> > migration started.
> >
> > If you wall have any ideas what switch-level setting might be enabled,
> > let me know, cause I'm stumped. I can add it to the ticket that's
> > requesting the port configurations. I've already added the port
> > numbers and switch name that I got from CDP.
> >
> > Thanks again, I really appreciate the help!
> > cecjr
> >
> >
> >
> > On Fri, Aug 23, 2019 at 3:28 PM Dominik Holler <dholler@redhat.com> wrote:
> > >
> > >
> > >
> > > On Fri, Aug 23, 2019 at 9:19 PM Dominik Holler <dholler@redhat.com> wrote:
> > >>
> > >>
> > >>
> > >> On Fri, Aug 23, 2019 at 8:03 PM Curtis E. Combs Jr. <ej.albany@gmail.com> wrote:
> > >>>
> > >>> This little cluster isn't in production or anything like that yet.
> > >>>
> > >>> So, I went ahead and used your ethtool commands to disable pause
> > >>> frames on both interfaces of each server. I then, chose a few VMs to
> > >>> migrate around at random.
> > >>>
> > >>> swm-02 and swm-03 both went out again. Unreachable. Can't ping, can't
> > >>> ssh, and the SSH session that I had open was unresponsive.
> > >>>
> > >>> Any other ideas?
> > >>>
> > >>
> > >> Sorry, no. Looks like two different NICs with different drivers and frimware goes down together.
> > >> This is a strong indication that the root cause is related to the switch.
> > >> Maybe you can get some information about the switch config by
> > >> 'lldptool get-tlv -n -i em1'
> > >>
> > >
> > > Another guess:
> > > After the optional 'lldptool get-tlv -n -i em1'
> > > 'systemctl stop lldpad'
> > > another try to migrate.
> > >
> > >
> > >>
> > >>
> > >>>
> > >>> On Fri, Aug 23, 2019 at 1:50 PM Dominik Holler <dholler@redhat.com> wrote:
> > >>> >
> > >>> >
> > >>> >
> > >>> > On Fri, Aug 23, 2019 at 6:45 PM Curtis E. Combs Jr. <ej.albany@gmail.com> wrote:
> > >>> >>
> > >>> >> Unfortunately, I can't check on the switch. Trust me, I've tried.
> > >>> >> These servers are in a Co-Lo and I've put 5 tickets in asking about
> > >>> >> the port configuration. They just get ignored - but that's par for the
> > >>> >> coarse for IT here. Only about 2 out of 10 of our tickets get any
> > >>> >> response and usually the response doesn't help. Then the system they
> > >>> >> use auto-closes the ticket. That was why I was suspecting STP before.
> > >>> >>
> > >>> >> I can do ethtool. I do have root on these servers, though. Are you
> > >>> >> trying to get me to turn off link-speed auto-negotiation? Would you
> > >>> >> like me to try that?
> > >>> >>
> > >>> >
> > >>> > It is just a suspicion, that the reason is pause frames.
> > >>> > Let's start on a NIC which is not used for ovirtmgmt, I guess em1.
> > >>> > Does 'ethtool -S em1  | grep pause' show something?
> > >>> > Does 'ethtool em1 | grep pause' indicates support for pause?
> > >>> > The current config is shown by 'ethtool -a em1'.
> > >>> > '-A autoneg' "Specifies whether pause autonegotiation should be enabled." according to ethtool doc.
> > >>> > Assuming flow control is enabled by default, I would try to  disable it via
> > >>> > 'ethtool -A em1 autoneg off rx off tx off'
> > >>> > and check if it is applied via
> > >>> > 'ethtool -a em1'
> > >>> > and check if the behavior under load changes.
> > >>> >
> > >>> >
> > >>> >
> > >>> >>
> > >>> >> On Fri, Aug 23, 2019 at 12:24 PM Dominik Holler <dholler@redhat.com> wrote:
> > >>> >> >
> > >>> >> >
> > >>> >> >
> > >>> >> > On Fri, Aug 23, 2019 at 5:49 PM Curtis E. Combs Jr. <ej.albany@gmail.com> wrote:
> > >>> >> >>
> > >>> >> >> Sure! Right now, I only have a 5