Dan,
Thanks for your interest and assistance.
----- Original Message -----
From: "Dan Kenigsberg" <danken(a)redhat.com>
To: "Lewis Shobbrook" <lew(a)redgrid.net>
Cc: users(a)ovirt.org
Sent: Friday, October 3, 2014 7:00:54 AM
Subject: Re: [ovirt-users] Falsely detected network device status
On Thu, Oct 02, 2014 at 12:14:40AM +1000, Lewis Shobbrook wrote:
> Hi All,
>
> I've encountered a network device that is being falsely detected as
> down.
> On adding a new network to a 3.4 hosted engine, one of the two
> hosts is displaying a component interface as down with the
> webadmin console.
Which network did you add? On top of which nics? could you share the
setupNetwork command from vdsm.log?
Here's the output relating to setupNetwork below (vdsm.log on the working node).
Thread-14::DEBUG::2014-09-30 14:53:29,487::BindingXMLRPC::1067::vds::(wrapper) client
[172.17.171.247]::call setupNetworks with ({'EVD_DMZ': {'nic':
'eth1', 'STP': 'no', 'bridged': 'true'}}, {},
{'connectivityCheck': 'true', 'connectivityTimeout': 120}) {}
flowID [1be04f8a]
Thread-13::DEBUG::2014-09-30 14:53:29,488::BindingXMLRPC::1067::vds::(wrapper) client
[172.17.171.247]::call ping with () {} flowID [1be04f8a]
Thread-13::DEBUG::2014-09-30 14:53:29,490::BindingXMLRPC::1074::vds::(wrapper) return ping
with {'status': {'message': 'Done', 'code': 0}}
Thread-13::DEBUG::2014-09-30 14:53:29,999::BindingXMLRPC::1067::vds::(wrapper) client
[172.17.171.247]::call ping with () {} flowID [1be04f8a]
Thread-13::DEBUG::2014-09-30 14:53:30,005::BindingXMLRPC::1074::vds::(wrapper) return ping
with {'status': {'message': 'Done', 'code': 0}}
Thread-13::DEBUG::2014-09-30 14:53:30,508::BindingXMLRPC::1067::vds::(wrapper) client
[172.17.171.247]::call ping with () {} flowID [1be04f8a]
Thread-13::DEBUG::2014-09-30 14:53:30,509::BindingXMLRPC::1074::vds::(wrapper) return ping
with {'status': {'message': 'Done', 'code': 0}}
Thread-13::DEBUG::2014-09-30 14:53:31,012::BindingXMLRPC::1067::vds::(wrapper) client
[172.17.171.247]::call ping with () {} flowID [1be04f8a]
Thread-13::DEBUG::2014-09-30 14:53:31,012::BindingXMLRPC::1074::vds::(wrapper) return ping
with {'status': {'message': 'Done', 'code': 0}}
Thread-13::DEBUG::2014-09-30 14:53:31,515::BindingXMLRPC::1067::vds::(wrapper) client
[172.17.171.247]::call ping with () {} flowID [1be04f8a]
Thread-13::DEBUG::2014-09-30 14:53:31,516::BindingXMLRPC::1074::vds::(wrapper) return ping
with {'status': {'message': 'Done', 'code': 0}}
Thread-14::DEBUG::2014-09-30 14:53:31,697::BindingXMLRPC::1074::vds::(wrapper) return
setupNetworks with {'status': {'message': 'Done', 'code':
0}}
Thread-5598871::DEBUG::2014-09-30 14:53:31,816::BindingXMLRPC::1067::vds::(wrapper) client
[172.17.170.100]::call getCapabilities with () {}
Thread-5598871::DEBUG::2014-09-30 14:53:31,818::utils::642::root::(execCmd) '/sbin/ip
route show to 0.0.0.0/0 table all' (cwd None)
Thread-5598871::DEBUG::2014-09-30 14:53:31,842::utils::662::root::(execCmd) SUCCESS:
<err> = ''; <rc> = 0
While on our problem node, surprisingly there is no setupNetwork at all, but thought
I'd mention this entry that seems odd.
Thread-5032587::INFO::2014-09-30 14:53:43,559::netinfo::500::root::(_getNetInfo) Obtaining
info for net virbr0.
Traceback (most recent call last):
File "/usr/lib64/python2.6/site-packages/vdsm/netinfo.py", line 479, in
_getNetInfo
data.update({'ports': ports(iface), 'stp': bridge_stp_state(iface),
File "/usr/lib64/python2.6/site-packages/vdsm/netinfo.py", line 197, in ports
return os.listdir('/sys/class/net/' + bridge + '/brif')
OSError: [Errno 2] No such file or directory: '/sys/class/net/virbr0/brif'
Roughly 4 seconds later we start to see the VM migrations with the node set inactive;
apparently unrelated to the timeout set in setupNetwork.
> Ethtool confirms the link status as up and connectivity through
the
> corresponding interfaces of the network has been proven.
> Shortly after adding the network I noted that the node with the
> falsely detected iface status, was set Non-Operational.
> All running VM's evacuated.
>
> The engine.log showing...
> INFO
> [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector]
> (DefaultQuartzScheduler_Worker-33) [2daa7788] Correlation ID:
> null, Call Stack: null, Custom Event ID: -1, Message: Host
> hosted_engine_2 moved to Non-Operational state because interfaces
> 'eth1' are down but are needed by networks 'EVD_DMZ' in the
> current cluster
Do you have /var/log/vdsm/connectivity.log (it's a recent addition)?
It may should have traces of eth1 going down. Does it? When did it
happen? What else happen at that time on the host?
No connectivity log unfortunately.
The strange thing is that the interface is up and stays up, but the correct state is not
detected or updated.
The other working node if of identical hardware connected to the same switch.
I've reactivated the problem node by removing the default cluster setting of
"require all" for the troubled logical network.
Given the misadventure here, I'm wondering if "require all" is the safest
default.
>
> 2 other interfaces for ovirtmgmt and another VM network on this
> same host have long functioned perfectly well.
>
> When assigning the network to the interface I see the following in
> messages...
>
> Oct 1 20:57:39 lx004 kernel: ADDRCONF(NETDEV_UP): eth1: link is
> not ready
> Oct 1 20:57:39 lx004 kernel: 8021q: adding VLAN 0 to HW filter on
> device eth1
> Oct 1 20:57:39 lx004 kernel: device eth1 entered promiscuous mode
> Oct 1 20:57:40 lx004 kernel: igb: eth1 NIC Link is Up 100 Mbps
> Full Duplex, Flow Control: RX
> Oct 1 20:57:40 lx004 kernel: ADDRCONF(NETDEV_CHANGE): eth1: link
> becomes ready
> Oct 1 20:57:40 lx004 kernel: EVD_DMZ: port 1(eth1) entering
> forwarding state
>
> Looks as it should to me, brctl also shows the bridge is as it
> should be.
>
> # brctl show
> bridge name bridge id STP enabled interfaces
> ;vdsmdummy; 8000.000000000000 no
> EVD_DMZ 8000.0025901abc49 no eth1
> PXE 8000.0007e9a5a197 no eth3
> ovirtmgmt 8000.0025901abc48 no eth0
>
> Not able to see anything relevant in vdsm.log
>
> vdsClient -s 0 getVdsStats
> does not list the new network, but instead lists an older deleted
> one.
Could you elaborate on that? what's reported there exactly? what
should
be? Is getVdsCaps up-to-date?
Yes getVdsCaps is up to date and lists the new network, but getVdsStats list an old
deleted network (ALS_SCO); only on our problem node.
Here's a snippet from the stats output on the troubled node...
anonHugePages = '11114'
cpuIdle = '96.55'
cpuLoad = '0.65'
cpuSys = '0.96'
cpuSysVdsmd = '1.87'
cpuUser = '2.49'
cpuUserVdsmd = '6.09'
dateTime = '2014-10-04T12:58:49 GMT'
diskStats = {'/tmp': {'free': '44526'},
'/var/log': {'free': '44526'},
'/var/log/core': {'free': '44526'},
'/var/run/vdsm/': {'free': '44526'}}
elapsedTime = '6469523'
generationID = '9e1d568d-7070-4c2d-8450-e0e56ccba533'
haScore = 2400
haStats = {'active': True,
'configured': True,
'globalMaintenance': False,
'localMaintenance': False,
'score': 2400}
ksmCpu = 0
ksmPages = 100
ksmState = False
memAvailable = 21479
memCommitted = 0
memFree = 21735
memShared = 0
memUsed = '53'
momStatus = 'inactive'
netConfigDirty = 'False'
network = {';vdsmdummy;': {'name': ';vdsmdummy;',
'rxDropped': '0',
'rxErrors': '0',
'rxRate': '0.0',
'speed': '1000',
'state': 'down',
'txDropped': '0',
'txErrors': '0',
'txRate': '0.0'},
'ALS_SCO': {'name': 'ALS_SCO',
'rxDropped': '0',
'rxErrors': '0',
'rxRate': '0.0',
'speed': '1000',
'state': 'up',
'txDropped': '0',
'txErrors': '0',
'txRate': '0.0'},
'PXE': {'name': 'PXE',
...
While the working node, lists the stats for the new interface as expected.
...
network = {';vdsmdummy;': {'name': ';vdsmdummy;',
'rxDropped': '0',
'rxErrors': '0',
'rxRate': '0.0',
'speed': '1000',
'state': 'down',
'txDropped': '0',
'txErrors': '0',
'txRate': '0.0'},
'EVD_DMZ': {'name': 'EVD_DMZ',
'rxDropped': '0',
'rxErrors': '0',
'rxRate': '0.0',
'speed': '1000',
'state': 'up',
'txDropped': '0',
'txErrors': '0',
'txRate': '0.0'},
'PXE': {'name': 'PXE',
...
Do you think perhaps the update stats could be falling over the same obstacle as updating
the interface status for eth1?
There is an orphaned VM that was left in a locked state following a failed attempt to
commit a snapshot change some time ago.
I'm wondering if this might be causing an exception.
I've seen details of sql cleanups employed against VM's in similar states, but
have not yet had time to attempt such surgery.
Due to a critical failure on a legacy system, this proof of concept Ovirt deployment was
promoted from semi-active to active service with the addition of this new network.
The added pain of this development is that I now lack the same freedom to break things in
the name of science.
Can't help myself but smile at the comedy of it.
Cheers,
Lew