On Wed, Sep 5, 2018 at 1:05 PM Miguel Duarte de Mora Barroso <
mdbarroso(a)redhat.com> wrote:
Hi Gianluca,
I really don't think it should.
Hi Miguel,
thanks for your feedback.
Actually my doubts and my question originate from a particular failure
detected.
I'm going to explain better the environment.
I have two hypervisors hv1 and hv2 with oVirt 4.2.5. They are placed into 2
different racks, rack1 and rack2.
I have a virtual cluster used for testing/scalability purposes, composed
by 4 pacemaker/corosync nodes with CentOS 7.4 as OS.
2 nodes (cl1 and cl2) of this virtual cluster are VMs running on hv1 and 2
nodes (cl3 and cl4) are VMs unning on hv2.
hv1 is in rack1 and hv2 is in rack2
They simulate possible future scenario of a physical stretched cluster with
2 nodes in datacenter1 and 2 nodes in datacenter 2.
Due to a network problem rack1 has been isolated for about 1 minute.
What I registered in /var/log/messages of cl* has been this one below:
. cl1
Aug 31 14:53:33 cl1 corosync[1291]: [TOTEM ] A processor failed, forming
new configuration.
Aug 31 14:53:36 cl1 corosync[1291]: [TOTEM ] A new membership (
172.16.1.68:436) was formed. Members left: 4 2 3
Aug 31 14:53:36 cl1 corosync[1291]: [TOTEM ] Failed to receive the leave
message. failed: 4 2 3
. cl2
Aug 31 14:53:33 cl2 corosync[32749]: [TOTEM ] A processor failed, forming
new configuration.
Aug 31 14:53:36 cl2 corosync[32749]: [TOTEM ] A new membership (
172.16.1.69:436) was formed. Members left: 4 1 3
Aug 31 14:53:36 cl2 corosync[32749]: [TOTEM ] Failed to receive the leave
message. failed: 4 1 3
- cl3
Aug 31 14:53:33 cl3 corosync[1282]: [TOTEM ] A processor failed, forming
new configuration.
Aug 31 14:54:10 cl3 corosync[1282]: [TOTEM ] A new membership (
172.16.1.63:432) was formed. Members left: 1 2
Aug 31 14:54:10 cl3 corosync[1282]: [TOTEM ] Failed to receive the leave
message. failed: 1 2
- cl4
Aug 31 14:53:33 cl4 corosync[1295]: [TOTEM ] A processor failed, forming
new configuration.
Aug 31 14:54:10 cl4 corosync[1295]: [TOTEM ] A new membership (
172.16.1.63:432) was formed. Members left: 1 2
Aug 31 14:54:10 cl4 corosync[1295]: [TOTEM ] Failed to receive the leave
message. failed: 1 2
The intracluster of this virtual cluster is on OVN and the isolation of
rack1 caused the virtual nodes inside hv1 not to be able to see any of the
other nodes, including the other VM running inside the same hypervisor
So cl1 lost 2, 3 and 4; cl2 lost 1, 3 and 4
While both cl3 and cl4 only lost 1 and 2
I supposed that due to the isolation of hv1 the VMs cl1 and cl2 able to see
each other over their OVN based vnic.
Just for reference the node hv1 (real hostname ov200) got these messages
(storage domains are on iSCSI, so unaccessible during the rack1 isolation):
Aug 31 14:53:04 ov200 ovn-controller: ovs|26823|reconnect|ERR|ssl:
10.4.192.49:6642: no response to inactivity probe after 5 seconds,
disconnecting
Aug 31 14:53:11 ov200 kernel: connection2:0: ping timeout of 5 secs
expired, recv timeout 5, last rx 4562746767, last ping 4562751768, now
4562756784
Aug 31 14:53:11 ov200 kernel: connection1:0: ping timeout of 5 secs
expired, recv timeout 5, last rx 4562746768, last ping 4562751770, now
4562756784
Aug 31 14:53:11 ov200 kernel: connection1:0: detected conn error (1022)
Aug 31 14:53:11 ov200 kernel: connection2:0: detected conn error (1022)
Aug 31 14:53:11 ov200 iscsid: Kernel reported iSCSI connection 1:0 error
(1022 - Invalid or unknown error code) state (3)
Aug 31 14:53:11 ov200 iscsid: Kernel reported iSCSI connection 2:0 error
(1022 - Invalid or unknown error code) state (3)
. . .
Aug 31 14:54:04 ov200 multipathd: 8:32: reinstated
Aug 31 14:54:04 ov200 kernel: device-mapper: multipath: Reinstating path
8:32.
Aug 31 14:54:04 ov200 multipathd: 36090a0d88034667163b315f8c906b0ac:
remaining active paths: 2
Could you provide the output of 'ovs-ofctl dump-flows br-int' *before*
and *after* engine is shutdown ?
Unfortunately not.
If it can help verify current situation with all ok and cl1 and cl2 running
on hv1, here it is the output on hv1:
https://drive.google.com/file/d/1gLtpkKFCBXV46lXJYsMlbonp853EqLun/view?us...
My question regarding engine originated because as a side effect of rack1
isolation, the engine, that is a VM in another environment and configured
as the OVN provider, has been unreachable for about 1 minute during the
problem. And I saw the first line of ov200 log above:
Aug 31 14:53:04 ov200 ovn-controller: ovs|26823|reconnect|ERR|ssl:
10.4.192.49:6642: no response to inactivity probe after 5 seconds,
disconnecting
Also outputs to 'ovs-vsctl show' and 'ovs-ofctl show
br-int' . Also
before and after engine-shutdown.
Now where all is ok and cl1 and cl2 running on hv1:
# ovs-vsctl show
0c8ccaa3-b215-4860-8102-0ea7a24ebcaf
Bridge br-int
fail_mode: secure
Port "ovn-8eea86-0"
Interface "ovn-8eea86-0"
type: geneve
options: {csum="true", key=flow,
remote_ip="10.4.192.48"}
Port br-int
Interface br-int
type: internal
Port "vnet3"
Interface "vnet3"
Port "vnet1"
Interface "vnet1"
ovs_version: "2.9.0"
# ovs-ofctl show br-int
OFPT_FEATURES_REPLY (xid=0x2): dpid:0000ce296715474c
n_tables:254, n_buffers:0
capabilities: FLOW_STATS TABLE_STATS PORT_STATS QUEUE_STATS ARP_MATCH_IP
actions: output enqueue set_vlan_vid set_vlan_pcp strip_vlan mod_dl_src
mod_dl_dst mod_nw_src mod_nw_dst mod_nw_tos mod_tp_src mod_tp_dst
1(vnet1): addr:fe:1a:4a:16:01:07
config: 0
state: 0
current: 10MB-FD COPPER
speed: 10 Mbps now, 0 Mbps max
2(vnet3): addr:fe:1a:4a:16:01:08
config: 0
state: 0
current: 10MB-FD COPPER
speed: 10 Mbps now, 0 Mbps max
5(ovn-8eea86-0): addr:92:30:37:41:00:43
config: 0
state: 0
speed: 0 Mbps now, 0 Mbps max
LOCAL(br-int): addr:ce:29:67:15:47:4c
config: PORT_DOWN
state: LINK_DOWN
speed: 0 Mbps now, 0 Mbps max
OFPT_GET_CONFIG_REPLY (xid=0x4): frags=normal miss_send_len=0
All of the above on the host where the VMs are running.
Another question; is the OVN network you created an overlay, or is it
attached to a physical network?
I think you mean overlay, because switch type of the cluster is "Linux
Bridge"
Regards,
Miguel
Thanks in advance for your time,
Gianluca