vm with disks on multiple glusterfs domains fails if a gluster host goes down

It seems that a vm with 3 disks boot in domain engine another disk in domain vol1 and a third in domain v3 became non responsive when one gluster host went down. To explain a bit the situation I have 3 glusterfs hosts with 3 volumes hosts are g1,g2,g3 each have 3 bricks g1 has vol1,vol2 and vol3 arbiter g2 has vol1, vol2arbiter and vol3 g3 has vol1arb vol2 and vol3 libgfapi is enabled . I put a host in maintenance to update the bios and the vm who had disks in two domain became unresponsive.. is this normal? qemu logs showing that it tries Domain configuration shows host1 as primary for vol1 and host2 as primary for vol3 with the other two as backup-volfile servers.. it seems it always try to connect to the server that is down and not to one of the alternative hosts... is this libgapi/libvirt problem ? Here are some libvirt logs showing what it tries to do.. [2018-09-10 19:43:42.876114] T [socket.c:3133:socket_connect] 0-vol1-client-2: connecting 0x55ed673525c0, state=2 gen=0 sock=-1 [2018-09-10 19:43:42.876124] T [name.c:243:af_inet_client_get_remote_sockaddr] 0-vol1-client-2: option remote-port missing in volume vol1-client-2. Defaulting to 24007 [2018-09-10 19:43:42.878566] D [socket.c:3051:socket_fix_ssl_opts] 0-vol1-client-2: disabling SSL for portmapper connection [2018-09-10 19:43:42.878770] T [socket.c:834:__socket_nodelay] 0-vol1-client-2: NODELAY enabled for socket 30 [2018-09-10 19:43:42.878780] T [socket.c:920:__socket_keepalive] 0-vol1-client-2: Keep-alive enabled for socket: 30, (idle: 20, interval: 2, max-probes: 9, timeout: 0) [2018-09-10 19:43:42.878830] T [rpc-clnt.c:406:rpc_clnt_reconnect] 0-vol3-client-1: attempting reconnect [2018-09-10 19:43:42.878846] T [socket.c:3133:socket_connect] 0-vol3-client-1: connecting 0x55ed673546c0, state=2 gen=0 sock=-1 [2018-09-10 19:43:42.878856] T [name.c:243:af_inet_client_get_remote_sockaddr] 0-vol3-client-1: option remote-port missing in volume vol3-client-1. Defaulting to 24007 [2018-09-10 19:43:42.881229] D [socket.c:3051:socket_fix_ssl_opts] 0-vol3-client-1: disabling SSL for portmapper connection [2018-09-10 19:43:42.881255] T [socket.c:834:__socket_nodelay] 0-vol3-client-1: NODELAY enabled for socket 38 [2018-09-10 19:43:42.881264] T [socket.c:920:__socket_keepalive] 0-vol3-client-1: Keep-alive enabled for socket: 38, (idle: 20, interval: 2, max-probes: 9, timeout: 0) [2018-09-10 19:43:45.569298] T [socket.c:724:__socket_disconnect] 0-vol3-client-1: disconnecting 0x55ed673546c0, state=2 gen=0 sock=38 [2018-09-10 19:43:45.569308] T [socket.c:724:__socket_disconnect] 0-vol1-client-2: disconnecting 0x55ed673525c0, state=2 gen=0 sock=30 [2018-09-10 19:43:45.570000] T [socket.c:728:__socket_disconnect] (--> /lib64/libglusterfs.so.0(_gf_log_callingfn+0x13b)[0x7fddbadade9b] (--> /usr/lib64/glusterfs/3.12.13/rpc-t ransport/socket.so(+0x4ea0)[0x7fdda7bbfea0] (--> /usr/lib64/glusterfs/3.12.13/rpc-transport/socket.so(+0x530a)[0x7fdda7bc030a] (--> /usr/lib64/glusterfs/3.12.13/rpc-transport/s ocket.so(+0x9a08)[0x7fdda7bc4a08] (--> /lib64/libglusterfs.so.0(+0x883c4)[0x7fddbae093c4] ))))) 0-vol3-client-1: tearing down socket connection [2018-09-10 19:43:45.570020] D [socket.c:686:__socket_shutdown] 0-vol3-client-1: shutdown() returned -1. Transport endpoint is not connected [2018-09-10 19:43:45.570038] D [socket.c:733:__socket_disconnect] 0-vol3-client-1: __socket_teardown_connection () failed: Transport endpoint is not connected [2018-09-10 19:43:45.570043] D [socket.c:2474:socket_event_handler] 0-transport: EPOLLERR - disconnecting now [2018-09-10 19:43:45.570907] T [socket.c:728:__socket_disconnect] (--> /lib64/libglusterfs.so.0(_gf_log_callingfn+0x13b)[0x7fddbadade9b] (--> /usr/lib64/glusterfs/3.12.13/rpc-t ransport/socket.so(+0x4ea0)[0x7fdda7bbfea0] (--> /usr/lib64/glusterfs/3.12.13/rpc-transport/socket.so(+0x530a)[0x7fdda7bc030a] (--> /usr/lib64/glusterfs/3.12.13/rpc-transport/s ocket.so(+0x9a08)[0x7fdda7bc4a08] (--> /lib64/libglusterfs.so.0(+0x883c4)[0x7fddbae093c4] ))))) 0-vol1-client-2: tearing down socket connection [2018-09-10 19:43:45.570928] D [socket.c:686:__socket_shutdown] 0-vol1-client-2: shutdown() returned -1. Transport endpoint is not connected [2018-09-10 19:43:45.570936] D [socket.c:733:__socket_disconnect] 0-vol1-client-2: __socket_teardown_connection () failed: Transport endpoint is not connected [2018-09-10 19:43:45.570940] D [socket.c:2474:socket_event_handler] 0-transport: EPOLLERR - disconnecting now [2018-09-10 19:43:45.570960] D [rpc-clnt-ping.c:99:rpc_clnt_remove_ping_timer_locked] (--> /lib64/libglusterfs.so.0(_gf_log_callingfn+0x13b)[0x7fddbadade9b] (--> /lib64/libgfrp c.so.0(rpc_clnt_remove_ping_timer_locked+0x8b)[0x7fddbab7828b] (--> /lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x5f)[0x7fddbab7460f] (--> /lib64/libgfrpc.so.0(rpc_clnt_no tify+0x2a0)[0x7fddbab75130] (--> /lib64/libgfrpc.so.0(rpc_transport_notify+0x23)[0x7fddbab70ea3] ))))) 0-: 10.xxx.xxx.130:24007: ping timer event already removed [2018-09-10 19:43:45.571098] D [rpc-clnt-ping.c:99:rpc_clnt_remove_ping_timer_locked] (--> /lib64/libglusterfs.so.0(_gf_log_callingfn+0x13b)[0x7fddbadade9b] (--> /lib64/libgfrp c.so.0(rpc_clnt_remove_ping_timer_locked+0x8b)[0x7fddbab7828b] (--> /lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x5f)[0x7fddbab7460f] (--> /lib64/libgfrpc.so.0(rpc_clnt_no tify+0x2a0)[0x7fddbab75130] (--> /lib64/libgfrpc.so.0(rpc_transport_notify+0x23)[0x7fddbab70ea3] ))))) 0-: 10.xxx.xxx.130:24007: ping timer event already removed [2018-09-10 19:43:45.878885] T [rpc-clnt.c:406:rpc_clnt_reconnect] 0-vol1-client-2: attempting reconnect [2018-09-10 19:43:45.881546] T [socket.c:834:__socket_nodelay] 0-vol1-client-2: NODELAY enabled for socket 38 [2018-09-10 19:43:45.881555] T [socket.c:920:__socket_keepalive] 0-vol1-client-2: Keep-alive enabled for socket: 38, (idle: 20, interval: 2, max-probes: 9, timeout: 0) [2018-09-10 19:43:45.883839] D [socket.c:3051:socket_fix_ssl_opts] 0-vol3-client-1: disabling SSL for portmapper connection [2018-09-10 19:43:45.883878] T [socket.c:834:__socket_nodelay] 0-vol3-client-1: NODELAY enabled for socket 30 [2018-09-10 19:43:45.883886] T [socket.c:920:__socket_keepalive] 0-vol3-client-1: Keep-alive enabled for socket: 30, (idle: 20, interval: 2, max-probes: 9, timeout: 0) [2018-09-10 19:43:48.575316] T [socket.c:724:__socket_disconnect] 0-vol3-client-1: disconnecting 0x55ed673546c0, state=2 gen=0 sock=30 [2018-09-10 19:43:48.575329] T [socket.c:724:__socket_disconnect] 0-vol1-client-2: disconnecting 0x55ed673525c0, state=2 gen=0 sock=38 [2018-09-10 19:43:48.576022] T [socket.c:728:__socket_disconnect] (--> /lib64/libglusterfs.so.0(_gf_log_callingfn+0x13b)[0x7fddbadade9b] (--> /usr/lib64/glusterfs/3.12.13/rpc-transport/socket.so(+0x4ea0)[0x7fdda7bbfea0] (--> /usr/lib64/glusterfs/3.12.13/rpc-transport/socket.so(+0x530a)[0x7fdda7bc030a] (--> /usr/lib64/glusterfs/3.12.13/rpc-transport/socket.so(+0x9a08)[0x7fdda7bc4a08] (--> /lib64/libglusterfs.so.0(+0x883c4)[0x7fddbae093c4] ))))) 0-vol3-client-1: tearing down socket connection [2018-09-10 19:43:48.576045] D [socket.c:686:__socket_shutdown] 0-vol3-client-1: shutdown() returned -1. Transport endpoint is not connected [2018-09-10 19:43:48.576054] D [socket.c:733:__socket_disconnect] 0-vol3-client-1: __socket_teardown_connection () failed: Transport endpoint is not connected [2018-09-10 19:43:48.576059] D [socket.c:2474:socket_event_handler] 0-transport: EPOLLERR - disconnecting now [2018-09-10 19:43:48.576079] T [socket.c:728:__socket_disconnect] (--> /lib64/libglusterfs.so.0(_gf_log_callingfn+0x13b)[0x7fddbadade9b] (--> /usr/lib64/glusterfs/3.12.13/rpc-transport/socket.so(+0x4ea0)[0x7fdda7bbfea0] (--> /usr/lib64/glusterfs/3.12.13/rpc-transport/socket.so(+0x530a)[0x7fdda7bc030a] (--> /usr/lib64/glusterfs/3.12.13/rpc-transport/socket.so(+0x9a08)[0x7fdda7bc4a08] (--> /lib64/libglusterfs.so.0(+0x883c4)[0x7fddbae093c4] ))))) 0-vol1-client-2: tearing down socket connection [2018-09-10 19:43:48.576099] D [socket.c:686:__socket_shutdown] 0-vol1-client-2: shutdown() returned -1. Transport endpoint is not connected [2018-09-10 19:43:48.576106] D [socket.c:733:__socket_disconnect] 0-vol1-client-2: __socket_teardown_connection () failed: Transport endpoint is not connected [2018-09-10 19:43:48.576111] D [socket.c:2474:socket_event_handler] 0-transport: EPOLLERR - disconnecting now [2018-09-10 19:43:48.576879] D [rpc-clnt-ping.c:99:rpc_clnt_remove_ping_timer_locked] (--> /lib64/libglusterfs.so.0(_gf_log_callingfn+0x13b)[0x7fddbadade9b] (--> /lib64/libgfrpc.so.0(rpc_clnt_remove_ping_timer_locked+0x8b)[0x7fddbab7828b] (--> /lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x5f)[0x7fddbab7460f] (--> /lib64/libgfrpc.so.0(rpc_clnt_notify+0x2a0)[0x7fddbab75130] (--> /lib64/libgfrpc.so.0(rpc_transport_notify+0x23)[0x7fddbab70ea3] ))))) 0-: 10.xxx.xxx.130:24007: ping timer event already removed [2018-09-10 19:43:48.576958] D [rpc-clnt-ping.c:99:rpc_clnt_remove_ping_timer_locked] (--> /lib64/libglusterfs.so.0(_gf_log_callingfn+0x13b)[0x7fddbadade9b] (--> /lib64/libgfrpc.so.0(rpc_clnt_remove_ping_timer_locked+0x8b)[0x7fddbab7828b] (--> /lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x5f)[0x7fddbab7460f] (--> /lib64/libgfrpc.so.0(rpc_clnt_notify+0x2a0)[0x7fddbab75130] (--> /lib64/libgfrpc.so.0(rpc_transport_notify+0x23)[0x7fddbab70ea3] ))))) 0-: 10.xxx.xxx.130:24007: ping timer event already removed [2018-09-10 19:43:48.881651] T [rpc-clnt.c:406:rpc_clnt_reconnect] 0-vol1-client-2: attempting reconnect [2018-09-10 19:43:48.881667] T [socket.c:3133:socket_connect] 0-vol1-client-2: connecting 0x55ed673525c0, state=2 gen=0 sock=-1 [2018-09-10 19:43:48.881689] T [name.c:243:af_inet_client_get_remote_sockaddr] 0-vol1-client-2: option remote-port missing in volume vol1-client-2. Defaulting to 24007 [2018-09-10 19:43:48.884056] T [rpc-clnt.c:406:rpc_clnt_reconnect] 0-vol3-client-1: attempting reconnect [2018-09-10 19:43:48.884072] T [socket.c:3133:socket_connect] 0-vol3-client-1: connecting 0x55ed673546c0, state=2 gen=0 sock=-1 [2018-09-10 19:43:48.884084] T [name.c:243:af_inet_client_get_remote_sockaddr] 0-vol3-client-1: option remote-port missing in volume vol3-client-1. Defaulting to 24007 [2018-09-10 19:43:48.884190] D [socket.c:3051:socket_fix_ssl_opts] 0-vol1-client-2: disabling SSL for portmapper connection [2018-09-10 19:43:48.886524] T [socket.c:834:__socket_nodelay] 0-vol3-client-1: NODELAY enabled for socket 30 [2018-09-10 19:43:48.886532] T [socket.c:920:__socket_keepalive] 0-vol3-client-1: Keep-alive enabled for socket: 30, (idle: 20, interval: 2, max-probes: 9, timeout: 0) [2018-09-10 19:43:51.581293] T [socket.c:724:__socket_disconnect] 0-vol3-client-1: disconnecting 0x55ed673546c0, state=2 gen=0 sock=30 [2018-09-10 19:43:51.581293] T [socket.c:724:__socket_disconnect] 0-vol1-client-2: disconnecting 0x55ed673525c0, state=2 gen=0 sock=38 [2018-09-10 19:43:51.582009] T [socket.c:728:__socket_disconnect] (--> /lib64/libglusterfs.so.0(_gf_log_callingfn+0x13b)[0x7fddbadade9b] (--> /usr/lib64/glusterfs/3.12.13/rpc-transport/socket.so(+0x4ea0)[0x7fdda7bbfea0] (--> /usr/lib64/glusterfs/3.12.13/rpc-transport/socket.so(+0x530a)[0x7fdda7bc030a] (--> /usr/lib64/glusterfs/3.12.13/rpc-transport/socket.so(+0x9a08)[0x7fdda7bc4a08] (--> /lib64/libglusterfs.so.0(+0x883c4)[0x7fddbae093c4] ))))) 0-vol1-client-2: tearing down socket connection [2018-09-10 19:43:51.582030] D [socket.c:686:__socket_shutdown] 0-vol1-client-2: shutdown() returned -1. Transport endpoint is not connected [2018-09-10 19:43:51.582036] D [socket.c:733:__socket_disconnect] 0-vol1-client-2: __socket_teardown_connection () failed: Transport endpoint is not connected [2018-09-10 19:43:51.582040] D [socket.c:2474:socket_event_handler] 0-transport: EPOLLERR - disconnecting now [2018-09-10 19:43:51.582084] T [socket.c:728:__socket_disconnect] (--> /lib64/libglusterfs.so.0(_gf_log_callingfn+0x13b)[0x7fddbadade9b] (--> /usr/lib64/glusterfs/3.12.13/rpc-transport/socket.so(+0x4ea0)[0x7fdda7bbfea0] (--> /usr/lib64/glusterfs/3.12.13/rpc-transport/socket.so(+0x530a)[0x7fdda7bc030a] (--> /usr/lib64/glusterfs/3.12.13/rpc-transport/socket.so(+0x9a08)[0x7fdda7bc4a08] (--> /lib64/libglusterfs.so.0(+0x883c4)[0x7fddbae093c4] ))))) 0-vol3-client-1: tearing down socket connection [2018-09-10 19:43:51.582105] D [socket.c:686:__socket_shutdown] 0-vol3-client-1: shutdown() returned -1. Transport endpoint is not connected [2018-09-10 19:43:51.582111] D [socket.c:733:__socket_disconnect] 0-vol3-client-1: __socket_teardown_connection () failed: Transport endpoint is not connected [2018-09-10 19:43:51.582116] D [socket.c:2474:socket_event_handler] 0-transport: EPOLLERR - disconnecting now [2018-09-10 19:43:51.582812] D [rpc-clnt-ping.c:99:rpc_clnt_remove_ping_timer_locked] (--> /lib64/libglusterfs.so.0(_gf_log_callingfn+0x13b)[0x7fddbadade9b] (--> /lib64/libgfrpc.so.0(rpc_clnt_remove_ping_timer_locked+0x8b)[0x7fddbab7828b] (--> /lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x5f)[0x7fddbab7460f] (--> /lib64/libgfrpc.so.0(rpc_clnt_notify+0x2a0)[0x7fddbab75130] (--> /lib64/libgfrpc.so.0(rpc_transport_notify+0x23)[0x7fddbab70ea3] ))))) 0-: 10.xxx.xxx.130:24007: ping timer event already removed [2018-09-10 19:43:51.582865] D [rpc-clnt-ping.c:99:rpc_clnt_remove_ping_timer_locked] (--> /lib64/libglusterfs.so.0(_gf_log_callingfn+0x13b)[0x7fddbadade9b] (--> /lib64/libgfrpc.so.0(rpc_clnt_remove_ping_timer_locked+0x8b)[0x7fddbab7828b] (--> /lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x5f)[0x7fddbab7460f] (--> /lib64/libgfrpc.so.0(rpc_clnt_notify+0x2a0)[0x7fddbab75130] (--> /lib64/libgfrpc.so.0(rpc_transport_notify+0x23)[0x7fddbab70ea3] ))))) 0-: 10.xxx.xxx.130:24007: ping timer event already removed [2018-09-10 19:43:51.884349] T [rpc-clnt.c:406:rpc_clnt_reconnect] 0-vol1-client-2: attempting reconnect [2018-09-10 19:43:51.884367] T [socket.c:3133:socket_connect] 0-vol1-client-2: connecting 0x55ed673525c0, state=2 gen=0 sock=-1 [2018-09-10 19:43:51.884376] T [name.c:243:af_inet_client_get_remote_sockaddr] 0-vol1-client-2: option remote-port missing in volume vol1-client-2. Defaulting to 24007 [2018-09-10 19:43:51.886644] T [rpc-clnt.c:406:rpc_clnt_reconnect] 0-vol3-client-1: attempting reconnect [2018-09-10 19:43:51.886659] T [socket.c:3133:socket_connect] 0-vol3-client-1: connecting 0x55ed673546c0, state=2 gen=0 sock=-1 [2018-09-10 19:43:51.886669] T [name.c:243:af_inet_client_get_remote_sockaddr] 0-vol3-client-1: option remote-port missing in volume vol3-client-1. Defaulting to 24007 [2018-09-10 19:43:51.887251] D [socket.c:3051:socket_fix_ssl_opts] 0-vol1-client-2: disabling SSL for portmapper connection [2018-09-10 19:43:51.887281] T [socket.c:834:__socket_nodelay] 0-vol1-client-2: NODELAY enabled for socket 38 [2018-09-10 19:43:51.887290] T [socket.c:920:__socket_keepalive] 0-vol1-client-2: Keep-alive enabled for socket: 38, (idle: 20, interval: 2, max-probes: 9, timeout: 0) [2018-09-10 19:43:51.889141] D [socket.c:3051:socket_fix_ssl_opts] 0-vol3-client-1: disabling SSL for portmapper connection :

On Tue, Sep 11, 2018 at 2:13 AM, <g.vasilopoulos@uoc.gr> wrote:
It seems that a vm with 3 disks boot in domain engine another disk in domain vol1 and a third in domain v3 became non responsive when one gluster host went down. To explain a bit the situation I have 3 glusterfs hosts with 3 volumes hosts are g1,g2,g3 each have 3 bricks g1 has vol1,vol2 and vol3 arbiter g2 has vol1, vol2arbiter and vol3 g3 has vol1arb vol2 and vol3 libgfapi is enabled . I put a host in maintenance to update the bios and the vm who had disks in two domain became unresponsive.. is this normal? qemu logs showing that it tries Domain configuration shows host1 as primary for vol1 and host2 as primary for vol3 with the other two as backup-volfile servers.. it seems it always try to connect to the server that is down and not to one of the alternative hosts... is this libgapi/libvirt problem ?
Yes, this is gfapi + libvirt issue. see https://bugzilla.redhat.com/show_bug.cgi?id=1484660 for details
Here are some libvirt logs showing what it tries to do.. [2018-09-10 19:43:42.876114] T [socket.c:3133:socket_connect] 0-vol1-client-2: connecting 0x55ed673525c0, state=2 gen=0 sock=-1 [2018-09-10 19:43:42.876124] T [name.c:243:af_inet_client_get_remote_sockaddr] 0-vol1-client-2: option remote-port missing in volume vol1-client-2. Defaulting to 24007 [2018-09-10 19:43:42.878566] D [socket.c:3051:socket_fix_ssl_opts] 0-vol1-client-2: disabling SSL for portmapper connection [2018-09-10 19:43:42.878770] T [socket.c:834:__socket_nodelay] 0-vol1-client-2: NODELAY enabled for socket 30 [2018-09-10 19:43:42.878780] T [socket.c:920:__socket_keepalive] 0-vol1-client-2: Keep-alive enabled for socket: 30, (idle: 20, interval: 2, max-probes: 9, timeout: 0) [2018-09-10 19:43:42.878830] T [rpc-clnt.c:406:rpc_clnt_reconnect] 0-vol3-client-1: attempting reconnect [2018-09-10 19:43:42.878846] T [socket.c:3133:socket_connect] 0-vol3-client-1: connecting 0x55ed673546c0, state=2 gen=0 sock=-1 [2018-09-10 19:43:42.878856] T [name.c:243:af_inet_client_get_remote_sockaddr] 0-vol3-client-1: option remote-port missing in volume vol3-client-1. Defaulting to 24007 [2018-09-10 19:43:42.881229] D [socket.c:3051:socket_fix_ssl_opts] 0-vol3-client-1: disabling SSL for portmapper connection [2018-09-10 19:43:42.881255] T [socket.c:834:__socket_nodelay] 0-vol3-client-1: NODELAY enabled for socket 38 [2018-09-10 19:43:42.881264] T [socket.c:920:__socket_keepalive] 0-vol3-client-1: Keep-alive enabled for socket: 38, (idle: 20, interval: 2, max-probes: 9, timeout: 0) [2018-09-10 19:43:45.569298] T [socket.c:724:__socket_disconnect] 0-vol3-client-1: disconnecting 0x55ed673546c0, state=2 gen=0 sock=38 [2018-09-10 19:43:45.569308] T [socket.c:724:__socket_disconnect] 0-vol1-client-2: disconnecting 0x55ed673525c0, state=2 gen=0 sock=30 [2018-09-10 19:43:45.570000] T [socket.c:728:__socket_disconnect] (--> /lib64/libglusterfs.so.0(_gf_log_callingfn+0x13b)[0x7fddbadade9b] (--> /usr/lib64/glusterfs/3.12.13/rpc-t ransport/socket.so(+0x4ea0)[0x7fdda7bbfea0] (--> /usr/lib64/glusterfs/3.12.13/rpc-transport/socket.so(+0x530a)[0x7fdda7bc030a] (--> /usr/lib64/glusterfs/3.12.13/rpc-transport/s ocket.so(+0x9a08)[0x7fdda7bc4a08] (--> /lib64/libglusterfs.so.0(+0x883c4)[0x7fddbae093c4] ))))) 0-vol3-client-1: tearing down socket connection [2018-09-10 19:43:45.570020] D [socket.c:686:__socket_shutdown] 0-vol3-client-1: shutdown() returned -1. Transport endpoint is not connected [2018-09-10 19:43:45.570038] D [socket.c:733:__socket_disconnect] 0-vol3-client-1: __socket_teardown_connection () failed: Transport endpoint is not connected [2018-09-10 19:43:45.570043] D [socket.c:2474:socket_event_handler] 0-transport: EPOLLERR - disconnecting now [2018-09-10 19:43:45.570907] T [socket.c:728:__socket_disconnect] (--> /lib64/libglusterfs.so.0(_gf_log_callingfn+0x13b)[0x7fddbadade9b] (--> /usr/lib64/glusterfs/3.12.13/rpc-t ransport/socket.so(+0x4ea0)[0x7fdda7bbfea0] (--> /usr/lib64/glusterfs/3.12.13/rpc-transport/socket.so(+0x530a)[0x7fdda7bc030a] (--> /usr/lib64/glusterfs/3.12.13/rpc-transport/s ocket.so(+0x9a08)[0x7fdda7bc4a08] (--> /lib64/libglusterfs.so.0(+0x883c4)[0x7fddbae093c4] ))))) 0-vol1-client-2: tearing down socket connection [2018-09-10 19:43:45.570928] D [socket.c:686:__socket_shutdown] 0-vol1-client-2: shutdown() returned -1. Transport endpoint is not connected [2018-09-10 19:43:45.570936] D [socket.c:733:__socket_disconnect] 0-vol1-client-2: __socket_teardown_connection () failed: Transport endpoint is not connected [2018-09-10 19:43:45.570940] D [socket.c:2474:socket_event_handler] 0-transport: EPOLLERR - disconnecting now [2018-09-10 19:43:45.570960] D [rpc-clnt-ping.c:99:rpc_clnt_remove_ping_timer_locked] (--> /lib64/libglusterfs.so.0(_gf_log_callingfn+0x13b)[0x7fddbadade9b] (--> /lib64/libgfrp c.so.0(rpc_clnt_remove_ping_timer_locked+0x8b)[0x7fddbab7828b] (--> /lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x5f)[0x7fddbab7460f] (--> /lib64/libgfrpc.so.0(rpc_clnt_no tify+0x2a0)[0x7fddbab75130] (--> /lib64/libgfrpc.so.0(rpc_trans port_notify+0x23)[0x7fddbab70ea3] ))))) 0-: 10.xxx.xxx.130:24007: ping timer event already removed [2018-09-10 19:43:45.571098] D [rpc-clnt-ping.c:99:rpc_clnt_remove_ping_timer_locked] (--> /lib64/libglusterfs.so.0(_gf_log_callingfn+0x13b)[0x7fddbadade9b] (--> /lib64/libgfrp c.so.0(rpc_clnt_remove_ping_timer_locked+0x8b)[0x7fddbab7828b] (--> /lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x5f)[0x7fddbab7460f] (--> /lib64/libgfrpc.so.0(rpc_clnt_no tify+0x2a0)[0x7fddbab75130] (--> /lib64/libgfrpc.so.0(rpc_trans port_notify+0x23)[0x7fddbab70ea3] ))))) 0-: 10.xxx.xxx.130:24007: ping timer event already removed [2018-09-10 19:43:45.878885] T [rpc-clnt.c:406:rpc_clnt_reconnect] 0-vol1-client-2: attempting reconnect [2018-09-10 19:43:45.881546] T [socket.c:834:__socket_nodelay] 0-vol1-client-2: NODELAY enabled for socket 38 [2018-09-10 19:43:45.881555] T [socket.c:920:__socket_keepalive] 0-vol1-client-2: Keep-alive enabled for socket: 38, (idle: 20, interval: 2, max-probes: 9, timeout: 0) [2018-09-10 19:43:45.883839] D [socket.c:3051:socket_fix_ssl_opts] 0-vol3-client-1: disabling SSL for portmapper connection [2018-09-10 19:43:45.883878] T [socket.c:834:__socket_nodelay] 0-vol3-client-1: NODELAY enabled for socket 30 [2018-09-10 19:43:45.883886] T [socket.c:920:__socket_keepalive] 0-vol3-client-1: Keep-alive enabled for socket: 30, (idle: 20, interval: 2, max-probes: 9, timeout: 0) [2018-09-10 19:43:48.575316] T [socket.c:724:__socket_disconnect] 0-vol3-client-1: disconnecting 0x55ed673546c0, state=2 gen=0 sock=30 [2018-09-10 19:43:48.575329] T [socket.c:724:__socket_disconnect] 0-vol1-client-2: disconnecting 0x55ed673525c0, state=2 gen=0 sock=38 [2018-09-10 19:43:48.576022] T [socket.c:728:__socket_disconnect] (--> /lib64/libglusterfs.so.0(_gf_log_callingfn+0x13b)[0x7fddbadade9b] (--> /usr/lib64/glusterfs/3.12.13/rpc-transport/socket.so(+0x4ea0)[0x7fdda7bbfea0] (--> /usr/lib64/glusterfs/3.12.13/rpc-transport/socket.so(+0x530a)[0x7fdda7bc030a] (--> /usr/lib64/glusterfs/3.12.13/rpc-transport/socket.so(+0x9a08)[0x7fdda7bc4a08] (--> /lib64/libglusterfs.so.0(+0x883c4)[0x7fddbae093c4] ))))) 0-vol3-client-1: tearing down socket connection [2018-09-10 19:43:48.576045] D [socket.c:686:__socket_shutdown] 0-vol3-client-1: shutdown() returned -1. Transport endpoint is not connected [2018-09-10 19:43:48.576054] D [socket.c:733:__socket_disconnect] 0-vol3-client-1: __socket_teardown_connection () failed: Transport endpoint is not connected [2018-09-10 19:43:48.576059] D [socket.c:2474:socket_event_handler] 0-transport: EPOLLERR - disconnecting now [2018-09-10 19:43:48.576079] T [socket.c:728:__socket_disconnect] (--> /lib64/libglusterfs.so.0(_gf_log_callingfn+0x13b)[0x7fddbadade9b] (--> /usr/lib64/glusterfs/3.12.13/rpc-transport/socket.so(+0x4ea0)[0x7fdda7bbfea0] (--> /usr/lib64/glusterfs/3.12.13/rpc-transport/socket.so(+0x530a)[0x7fdda7bc030a] (--> /usr/lib64/glusterfs/3.12.13/rpc-transport/socket.so(+0x9a08)[0x7fdda7bc4a08] (--> /lib64/libglusterfs.so.0(+0x883c4)[0x7fddbae093c4] ))))) 0-vol1-client-2: tearing down socket connection [2018-09-10 19:43:48.576099] D [socket.c:686:__socket_shutdown] 0-vol1-client-2: shutdown() returned -1. Transport endpoint is not connected [2018-09-10 19:43:48.576106] D [socket.c:733:__socket_disconnect] 0-vol1-client-2: __socket_teardown_connection () failed: Transport endpoint is not connected [2018-09-10 19:43:48.576111] D [socket.c:2474:socket_event_handler] 0-transport: EPOLLERR - disconnecting now [2018-09-10 19:43:48.576879] D [rpc-clnt-ping.c:99:rpc_clnt_remove_ping_timer_locked] (--> /lib64/libglusterfs.so.0(_gf_log_callingfn+0x13b)[0x7fddbadade9b] (--> /lib64/libgfrpc.so.0(rpc_clnt_remove_ping_timer_locked+0x8b)[0x7fddbab7828b] (--> /lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x5f)[0x7fddbab7460f] (--> /lib64/libgfrpc.so.0(rpc_clnt_notify+0x2a0)[0x7fddbab75130] (--> /lib64/libgfrpc.so.0(rpc_transport_notify+0x23)[0x7fddbab70ea3] ))))) 0-: 10.xxx.xxx.130:24007: ping timer event already removed [2018-09-10 19:43:48.576958] D [rpc-clnt-ping.c:99:rpc_clnt_remove_ping_timer_locked] (--> /lib64/libglusterfs.so.0(_gf_log_callingfn+0x13b)[0x7fddbadade9b] (--> /lib64/libgfrpc.so.0(rpc_clnt_remove_ping_timer_locked+0x8b)[0x7fddbab7828b] (--> /lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x5f)[0x7fddbab7460f] (--> /lib64/libgfrpc.so.0(rpc_clnt_notify+0x2a0)[0x7fddbab75130] (--> /lib64/libgfrpc.so.0(rpc_transport_notify+0x23)[0x7fddbab70ea3] ))))) 0-: 10.xxx.xxx.130:24007: ping timer event already removed [2018-09-10 19:43:48.881651] T [rpc-clnt.c:406:rpc_clnt_reconnect] 0-vol1-client-2: attempting reconnect [2018-09-10 19:43:48.881667] T [socket.c:3133:socket_connect] 0-vol1-client-2: connecting 0x55ed673525c0, state=2 gen=0 sock=-1 [2018-09-10 19:43:48.881689] T [name.c:243:af_inet_client_get_remote_sockaddr] 0-vol1-client-2: option remote-port missing in volume vol1-client-2. Defaulting to 24007 [2018-09-10 19:43:48.884056] T [rpc-clnt.c:406:rpc_clnt_reconnect] 0-vol3-client-1: attempting reconnect [2018-09-10 19:43:48.884072] T [socket.c:3133:socket_connect] 0-vol3-client-1: connecting 0x55ed673546c0, state=2 gen=0 sock=-1 [2018-09-10 19:43:48.884084] T [name.c:243:af_inet_client_get_remote_sockaddr] 0-vol3-client-1: option remote-port missing in volume vol3-client-1. Defaulting to 24007 [2018-09-10 19:43:48.884190] D [socket.c:3051:socket_fix_ssl_opts] 0-vol1-client-2: disabling SSL for portmapper connection [2018-09-10 19:43:48.886524] T [socket.c:834:__socket_nodelay] 0-vol3-client-1: NODELAY enabled for socket 30 [2018-09-10 19:43:48.886532] T [socket.c:920:__socket_keepalive] 0-vol3-client-1: Keep-alive enabled for socket: 30, (idle: 20, interval: 2, max-probes: 9, timeout: 0) [2018-09-10 19:43:51.581293] T [socket.c:724:__socket_disconnect] 0-vol3-client-1: disconnecting 0x55ed673546c0, state=2 gen=0 sock=30 [2018-09-10 19:43:51.581293] T [socket.c:724:__socket_disconnect] 0-vol1-client-2: disconnecting 0x55ed673525c0, state=2 gen=0 sock=38 [2018-09-10 19:43:51.582009] T [socket.c:728:__socket_disconnect] (--> /lib64/libglusterfs.so.0(_gf_log_callingfn+0x13b)[0x7fddbadade9b] (--> /usr/lib64/glusterfs/3.12.13/rpc-transport/socket.so(+0x4ea0)[0x7fdda7bbfea0] (--> /usr/lib64/glusterfs/3.12.13/rpc-transport/socket.so(+0x530a)[0x7fdda7bc030a] (--> /usr/lib64/glusterfs/3.12.13/rpc-transport/socket.so(+0x9a08)[0x7fdda7bc4a08] (--> /lib64/libglusterfs.so.0(+0x883c4)[0x7fddbae093c4] ))))) 0-vol1-client-2: tearing down socket connection [2018-09-10 19:43:51.582030] D [socket.c:686:__socket_shutdown] 0-vol1-client-2: shutdown() returned -1. Transport endpoint is not connected [2018-09-10 19:43:51.582036] D [socket.c:733:__socket_disconnect] 0-vol1-client-2: __socket_teardown_connection () failed: Transport endpoint is not connected [2018-09-10 19:43:51.582040] D [socket.c:2474:socket_event_handler] 0-transport: EPOLLERR - disconnecting now [2018-09-10 19:43:51.582084] T [socket.c:728:__socket_disconnect] (--> /lib64/libglusterfs.so.0(_gf_log_callingfn+0x13b)[0x7fddbadade9b] (--> /usr/lib64/glusterfs/3.12.13/rpc-transport/socket.so(+0x4ea0)[0x7fdda7bbfea0] (--> /usr/lib64/glusterfs/3.12.13/rpc-transport/socket.so(+0x530a)[0x7fdda7bc030a] (--> /usr/lib64/glusterfs/3.12.13/rpc-transport/socket.so(+0x9a08)[0x7fdda7bc4a08] (--> /lib64/libglusterfs.so.0(+0x883c4)[0x7fddbae093c4] ))))) 0-vol3-client-1: tearing down socket connection [2018-09-10 19:43:51.582105] D [socket.c:686:__socket_shutdown] 0-vol3-client-1: shutdown() returned -1. Transport endpoint is not connected [2018-09-10 19:43:51.582111] D [socket.c:733:__socket_disconnect] 0-vol3-client-1: __socket_teardown_connection () failed: Transport endpoint is not connected [2018-09-10 19:43:51.582116] D [socket.c:2474:socket_event_handler] 0-transport: EPOLLERR - disconnecting now [2018-09-10 19:43:51.582812] D [rpc-clnt-ping.c:99:rpc_clnt_remove_ping_timer_locked] (--> /lib64/libglusterfs.so.0(_gf_log_callingfn+0x13b)[0x7fddbadade9b] (--> /lib64/libgfrpc.so.0(rpc_clnt_remove_ping_timer_locked+0x8b)[0x7fddbab7828b] (--> /lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x5f)[0x7fddbab7460f] (--> /lib64/libgfrpc.so.0(rpc_clnt_notify+0x2a0)[0x7fddbab75130] (--> /lib64/libgfrpc.so.0(rpc_transport_notify+0x23)[0x7fddbab70ea3] ))))) 0-: 10.xxx.xxx.130:24007: ping timer event already removed [2018-09-10 19:43:51.582865] D [rpc-clnt-ping.c:99:rpc_clnt_remove_ping_timer_locked] (--> /lib64/libglusterfs.so.0(_gf_log_callingfn+0x13b)[0x7fddbadade9b] (--> /lib64/libgfrpc.so.0(rpc_clnt_remove_ping_timer_locked+0x8b)[0x7fddbab7828b] (--> /lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x5f)[0x7fddbab7460f] (--> /lib64/libgfrpc.so.0(rpc_clnt_notify+0x2a0)[0x7fddbab75130] (--> /lib64/libgfrpc.so.0(rpc_transport_notify+0x23)[0x7fddbab70ea3] ))))) 0-: 10.xxx.xxx.130:24007: ping timer event already removed [2018-09-10 19:43:51.884349] T [rpc-clnt.c:406:rpc_clnt_reconnect] 0-vol1-client-2: attempting reconnect [2018-09-10 19:43:51.884367] T [socket.c:3133:socket_connect] 0-vol1-client-2: connecting 0x55ed673525c0, state=2 gen=0 sock=-1 [2018-09-10 19:43:51.884376] T [name.c:243:af_inet_client_get_remote_sockaddr] 0-vol1-client-2: option remote-port missing in volume vol1-client-2. Defaulting to 24007 [2018-09-10 19:43:51.886644] T [rpc-clnt.c:406:rpc_clnt_reconnect] 0-vol3-client-1: attempting reconnect [2018-09-10 19:43:51.886659] T [socket.c:3133:socket_connect] 0-vol3-client-1: connecting 0x55ed673546c0, state=2 gen=0 sock=-1 [2018-09-10 19:43:51.886669] T [name.c:243:af_inet_client_get_remote_sockaddr] 0-vol3-client-1: option remote-port missing in volume vol3-client-1. Defaulting to 24007 [2018-09-10 19:43:51.887251] D [socket.c:3051:socket_fix_ssl_opts] 0-vol1-client-2: disabling SSL for portmapper connection [2018-09-10 19:43:51.887281] T [socket.c:834:__socket_nodelay] 0-vol1-client-2: NODELAY enabled for socket 38 [2018-09-10 19:43:51.887290] T [socket.c:920:__socket_keepalive] 0-vol1-client-2: Keep-alive enabled for socket: 38, (idle: 20, interval: 2, max-probes: 9, timeout: 0) [2018-09-10 19:43:51.889141] D [socket.c:3051:socket_fix_ssl_opts] 0-vol3-client-1: disabling SSL for portmapper connection : _______________________________________________ Users mailing list -- users@ovirt.org To unsubscribe send an email to users-leave@ovirt.org Privacy Statement: https://www.ovirt.org/site/privacy-policy/ oVirt Code of Conduct: https://www.ovirt.org/communit y/about/community-guidelines/ List Archives: https://lists.ovirt.org/archiv es/list/users@ovirt.org/message/LZ6HYGEQOPARSLOE64MJUZBML4XOLB5L/

On Tue, Sep 11, 2018 at 1:52 PM, Sahina Bose <sabose@redhat.com> wrote:
On Tue, Sep 11, 2018 at 2:13 AM, <g.vasilopoulos@uoc.gr> wrote:
It seems that a vm with 3 disks boot in domain engine another disk in domain vol1 and a third in domain v3 became non responsive when one gluster host went down. To explain a bit the situation I have 3 glusterfs hosts with 3 volumes hosts are g1,g2,g3 each have 3 bricks g1 has vol1,vol2 and vol3 arbiter g2 has vol1, vol2arbiter and vol3 g3 has vol1arb vol2 and vol3 libgfapi is enabled . I put a host in maintenance to update the bios and the vm who had disks in two domain became unresponsive.. is this normal? qemu logs showing that it tries Domain configuration shows host1 as primary for vol1 and host2 as primary for vol3 with the other two as backup-volfile servers.. it seems it always try to connect to the server that is down and not to one of the alternative hosts... is this libgapi/libvirt problem ?
Yes, this is gfapi + libvirt issue. see https://bugzilla.redhat.com/ show_bug.cgi?id=1484660 for details
Sorry, wrong bug..see https://bugzilla.redhat.com/show_bug.cgi?id=1484227
Here are some libvirt logs showing what it tries to do.. [2018-09-10 19:43:42.876114] T [socket.c:3133:socket_connect] 0-vol1-client-2: connecting 0x55ed673525c0, state=2 gen=0 sock=-1 [2018-09-10 19:43:42.876124] T [name.c:243:af_inet_client_get_remote_sockaddr] 0-vol1-client-2: option remote-port missing in volume vol1-client-2. Defaulting to 24007 [2018-09-10 19:43:42.878566] D [socket.c:3051:socket_fix_ssl_opts] 0-vol1-client-2: disabling SSL for portmapper connection [2018-09-10 19:43:42.878770] T [socket.c:834:__socket_nodelay] 0-vol1-client-2: NODELAY enabled for socket 30 [2018-09-10 19:43:42.878780] T [socket.c:920:__socket_keepalive] 0-vol1-client-2: Keep-alive enabled for socket: 30, (idle: 20, interval: 2, max-probes: 9, timeout: 0) [2018-09-10 19:43:42.878830] T [rpc-clnt.c:406:rpc_clnt_reconnect] 0-vol3-client-1: attempting reconnect [2018-09-10 19:43:42.878846] T [socket.c:3133:socket_connect] 0-vol3-client-1: connecting 0x55ed673546c0, state=2 gen=0 sock=-1 [2018-09-10 19:43:42.878856] T [name.c:243:af_inet_client_get_remote_sockaddr] 0-vol3-client-1: option remote-port missing in volume vol3-client-1. Defaulting to 24007 [2018-09-10 19:43:42.881229] D [socket.c:3051:socket_fix_ssl_opts] 0-vol3-client-1: disabling SSL for portmapper connection [2018-09-10 19:43:42.881255] T [socket.c:834:__socket_nodelay] 0-vol3-client-1: NODELAY enabled for socket 38 [2018-09-10 19:43:42.881264] T [socket.c:920:__socket_keepalive] 0-vol3-client-1: Keep-alive enabled for socket: 38, (idle: 20, interval: 2, max-probes: 9, timeout: 0) [2018-09-10 19:43:45.569298] T [socket.c:724:__socket_disconnect] 0-vol3-client-1: disconnecting 0x55ed673546c0, state=2 gen=0 sock=38 [2018-09-10 19:43:45.569308] T [socket.c:724:__socket_disconnect] 0-vol1-client-2: disconnecting 0x55ed673525c0, state=2 gen=0 sock=30 [2018-09-10 19:43:45.570000] T [socket.c:728:__socket_disconnect] (--> /lib64/libglusterfs.so.0(_gf_log_callingfn+0x13b)[0x7fddbadade9b] (--> /usr/lib64/glusterfs/3.12.13/rpc-t ransport/socket.so(+0x4ea0)[0x7fdda7bbfea0] (--> /usr/lib64/glusterfs/3.12.13/rpc-transport/socket.so(+0x530a)[0x7fdda7bc030a] (--> /usr/lib64/glusterfs/3.12.13/rpc-transport/s ocket.so(+0x9a08)[0x7fdda7bc4a08] (--> /lib64/libglusterfs.so.0(+0x883c4)[0x7fddbae093c4] ))))) 0-vol3-client-1: tearing down socket connection [2018-09-10 19:43:45.570020] D [socket.c:686:__socket_shutdown] 0-vol3-client-1: shutdown() returned -1. Transport endpoint is not connected [2018-09-10 19:43:45.570038] D [socket.c:733:__socket_disconnect] 0-vol3-client-1: __socket_teardown_connection () failed: Transport endpoint is not connected [2018-09-10 19:43:45.570043] D [socket.c:2474:socket_event_handler] 0-transport: EPOLLERR - disconnecting now [2018-09-10 19:43:45.570907] T [socket.c:728:__socket_disconnect] (--> /lib64/libglusterfs.so.0(_gf_log_callingfn+0x13b)[0x7fddbadade9b] (--> /usr/lib64/glusterfs/3.12.13/rpc-t ransport/socket.so(+0x4ea0)[0x7fdda7bbfea0] (--> /usr/lib64/glusterfs/3.12.13/rpc-transport/socket.so(+0x530a)[0x7fdda7bc030a] (--> /usr/lib64/glusterfs/3.12.13/rpc-transport/s ocket.so(+0x9a08)[0x7fdda7bc4a08] (--> /lib64/libglusterfs.so.0(+0x883c4)[0x7fddbae093c4] ))))) 0-vol1-client-2: tearing down socket connection [2018-09-10 19:43:45.570928] D [socket.c:686:__socket_shutdown] 0-vol1-client-2: shutdown() returned -1. Transport endpoint is not connected [2018-09-10 19:43:45.570936] D [socket.c:733:__socket_disconnect] 0-vol1-client-2: __socket_teardown_connection () failed: Transport endpoint is not connected [2018-09-10 19:43:45.570940] D [socket.c:2474:socket_event_handler] 0-transport: EPOLLERR - disconnecting now [2018-09-10 19:43:45.570960] D [rpc-clnt-ping.c:99:rpc_clnt_remove_ping_timer_locked] (--> /lib64/libglusterfs.so.0(_gf_log_callingfn+0x13b)[0x7fddbadade9b] (--> /lib64/libgfrp c.so.0(rpc_clnt_remove_ping_timer_locked+0x8b)[0x7fddbab7828b] (--> /lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x5f)[0x7fddbab7460f] (--> /lib64/libgfrpc.so.0(rpc_clnt_no tify+0x2a0)[0x7fddbab75130] (--> /lib64/libgfrpc.so.0(rpc_trans port_notify+0x23)[0x7fddbab70ea3] ))))) 0-: 10.xxx.xxx.130:24007: ping timer event already removed [2018-09-10 19:43:45.571098] D [rpc-clnt-ping.c:99:rpc_clnt_remove_ping_timer_locked] (--> /lib64/libglusterfs.so.0(_gf_log_callingfn+0x13b)[0x7fddbadade9b] (--> /lib64/libgfrp c.so.0(rpc_clnt_remove_ping_timer_locked+0x8b)[0x7fddbab7828b] (--> /lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x5f)[0x7fddbab7460f] (--> /lib64/libgfrpc.so.0(rpc_clnt_no tify+0x2a0)[0x7fddbab75130] (--> /lib64/libgfrpc.so.0(rpc_trans port_notify+0x23)[0x7fddbab70ea3] ))))) 0-: 10.xxx.xxx.130:24007: ping timer event already removed [2018-09-10 19:43:45.878885] T [rpc-clnt.c:406:rpc_clnt_reconnect] 0-vol1-client-2: attempting reconnect [2018-09-10 19:43:45.881546] T [socket.c:834:__socket_nodelay] 0-vol1-client-2: NODELAY enabled for socket 38 [2018-09-10 19:43:45.881555] T [socket.c:920:__socket_keepalive] 0-vol1-client-2: Keep-alive enabled for socket: 38, (idle: 20, interval: 2, max-probes: 9, timeout: 0) [2018-09-10 19:43:45.883839] D [socket.c:3051:socket_fix_ssl_opts] 0-vol3-client-1: disabling SSL for portmapper connection [2018-09-10 19:43:45.883878] T [socket.c:834:__socket_nodelay] 0-vol3-client-1: NODELAY enabled for socket 30 [2018-09-10 19:43:45.883886] T [socket.c:920:__socket_keepalive] 0-vol3-client-1: Keep-alive enabled for socket: 30, (idle: 20, interval: 2, max-probes: 9, timeout: 0) [2018-09-10 19:43:48.575316] T [socket.c:724:__socket_disconnect] 0-vol3-client-1: disconnecting 0x55ed673546c0, state=2 gen=0 sock=30 [2018-09-10 19:43:48.575329] T [socket.c:724:__socket_disconnect] 0-vol1-client-2: disconnecting 0x55ed673525c0, state=2 gen=0 sock=38 [2018-09-10 19:43:48.576022] T [socket.c:728:__socket_disconnect] (--> /lib64/libglusterfs.so.0(_gf_log_callingfn+0x13b)[0x7fddbadade9b] (--> /usr/lib64/glusterfs/3.12.13/rpc-transport/socket.so(+0x4ea0)[0x7fdda7bbfea0] (--> /usr/lib64/glusterfs/3.12.13/rpc-transport/socket.so(+0x530a)[0x7fdda7bc030a] (--> /usr/lib64/glusterfs/3.12.13/rpc-transport/socket.so(+0x9a08)[0x7fdda7bc4a08] (--> /lib64/libglusterfs.so.0(+0x883c4)[0x7fddbae093c4] ))))) 0-vol3-client-1: tearing down socket connection [2018-09-10 19:43:48.576045] D [socket.c:686:__socket_shutdown] 0-vol3-client-1: shutdown() returned -1. Transport endpoint is not connected [2018-09-10 19:43:48.576054] D [socket.c:733:__socket_disconnect] 0-vol3-client-1: __socket_teardown_connection () failed: Transport endpoint is not connected [2018-09-10 19:43:48.576059] D [socket.c:2474:socket_event_handler] 0-transport: EPOLLERR - disconnecting now [2018-09-10 19:43:48.576079] T [socket.c:728:__socket_disconnect] (--> /lib64/libglusterfs.so.0(_gf_log_callingfn+0x13b)[0x7fddbadade9b] (--> /usr/lib64/glusterfs/3.12.13/rpc-transport/socket.so(+0x4ea0)[0x7fdda7bbfea0] (--> /usr/lib64/glusterfs/3.12.13/rpc-transport/socket.so(+0x530a)[0x7fdda7bc030a] (--> /usr/lib64/glusterfs/3.12.13/rpc-transport/socket.so(+0x9a08)[0x7fdda7bc4a08] (--> /lib64/libglusterfs.so.0(+0x883c4)[0x7fddbae093c4] ))))) 0-vol1-client-2: tearing down socket connection [2018-09-10 19:43:48.576099] D [socket.c:686:__socket_shutdown] 0-vol1-client-2: shutdown() returned -1. Transport endpoint is not connected [2018-09-10 19:43:48.576106] D [socket.c:733:__socket_disconnect] 0-vol1-client-2: __socket_teardown_connection () failed: Transport endpoint is not connected [2018-09-10 19:43:48.576111] D [socket.c:2474:socket_event_handler] 0-transport: EPOLLERR - disconnecting now [2018-09-10 19:43:48.576879] D [rpc-clnt-ping.c:99:rpc_clnt_remove_ping_timer_locked] (--> /lib64/libglusterfs.so.0(_gf_log_callingfn+0x13b)[0x7fddbadade9b] (--> /lib64/libgfrpc.so.0(rpc_clnt_remove_ping_timer_locked+0x8b)[0x7fddbab7828b] (--> /lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x5f)[0x7fddbab7460f] (--> /lib64/libgfrpc.so.0(rpc_clnt_notify+0x2a0)[0x7fddbab75130] (--> /lib64/libgfrpc.so.0(rpc_transport_notify+0x23)[0x7fddbab70ea3] ))))) 0-: 10.xxx.xxx.130:24007: ping timer event already removed [2018-09-10 19:43:48.576958] D [rpc-clnt-ping.c:99:rpc_clnt_remove_ping_timer_locked] (--> /lib64/libglusterfs.so.0(_gf_log_callingfn+0x13b)[0x7fddbadade9b] (--> /lib64/libgfrpc.so.0(rpc_clnt_remove_ping_timer_locked+0x8b)[0x7fddbab7828b] (--> /lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x5f)[0x7fddbab7460f] (--> /lib64/libgfrpc.so.0(rpc_clnt_notify+0x2a0)[0x7fddbab75130] (--> /lib64/libgfrpc.so.0(rpc_transport_notify+0x23)[0x7fddbab70ea3] ))))) 0-: 10.xxx.xxx.130:24007: ping timer event already removed [2018-09-10 19:43:48.881651] T [rpc-clnt.c:406:rpc_clnt_reconnect] 0-vol1-client-2: attempting reconnect [2018-09-10 19:43:48.881667] T [socket.c:3133:socket_connect] 0-vol1-client-2: connecting 0x55ed673525c0, state=2 gen=0 sock=-1 [2018-09-10 19:43:48.881689] T [name.c:243:af_inet_client_get_remote_sockaddr] 0-vol1-client-2: option remote-port missing in volume vol1-client-2. Defaulting to 24007 [2018-09-10 19:43:48.884056] T [rpc-clnt.c:406:rpc_clnt_reconnect] 0-vol3-client-1: attempting reconnect [2018-09-10 19:43:48.884072] T [socket.c:3133:socket_connect] 0-vol3-client-1: connecting 0x55ed673546c0, state=2 gen=0 sock=-1 [2018-09-10 19:43:48.884084] T [name.c:243:af_inet_client_get_remote_sockaddr] 0-vol3-client-1: option remote-port missing in volume vol3-client-1. Defaulting to 24007 [2018-09-10 19:43:48.884190] D [socket.c:3051:socket_fix_ssl_opts] 0-vol1-client-2: disabling SSL for portmapper connection [2018-09-10 19:43:48.886524] T [socket.c:834:__socket_nodelay] 0-vol3-client-1: NODELAY enabled for socket 30 [2018-09-10 19:43:48.886532] T [socket.c:920:__socket_keepalive] 0-vol3-client-1: Keep-alive enabled for socket: 30, (idle: 20, interval: 2, max-probes: 9, timeout: 0) [2018-09-10 19:43:51.581293] T [socket.c:724:__socket_disconnect] 0-vol3-client-1: disconnecting 0x55ed673546c0, state=2 gen=0 sock=30 [2018-09-10 19:43:51.581293] T [socket.c:724:__socket_disconnect] 0-vol1-client-2: disconnecting 0x55ed673525c0, state=2 gen=0 sock=38 [2018-09-10 19:43:51.582009] T [socket.c:728:__socket_disconnect] (--> /lib64/libglusterfs.so.0(_gf_log_callingfn+0x13b)[0x7fddbadade9b] (--> /usr/lib64/glusterfs/3.12.13/rpc-transport/socket.so(+0x4ea0)[0x7fdda7bbfea0] (--> /usr/lib64/glusterfs/3.12.13/rpc-transport/socket.so(+0x530a)[0x7fdda7bc030a] (--> /usr/lib64/glusterfs/3.12.13/rpc-transport/socket.so(+0x9a08)[0x7fdda7bc4a08] (--> /lib64/libglusterfs.so.0(+0x883c4)[0x7fddbae093c4] ))))) 0-vol1-client-2: tearing down socket connection [2018-09-10 19:43:51.582030] D [socket.c:686:__socket_shutdown] 0-vol1-client-2: shutdown() returned -1. Transport endpoint is not connected [2018-09-10 19:43:51.582036] D [socket.c:733:__socket_disconnect] 0-vol1-client-2: __socket_teardown_connection () failed: Transport endpoint is not connected [2018-09-10 19:43:51.582040] D [socket.c:2474:socket_event_handler] 0-transport: EPOLLERR - disconnecting now [2018-09-10 19:43:51.582084] T [socket.c:728:__socket_disconnect] (--> /lib64/libglusterfs.so.0(_gf_log_callingfn+0x13b)[0x7fddbadade9b] (--> /usr/lib64/glusterfs/3.12.13/rpc-transport/socket.so(+0x4ea0)[0x7fdda7bbfea0] (--> /usr/lib64/glusterfs/3.12.13/rpc-transport/socket.so(+0x530a)[0x7fdda7bc030a] (--> /usr/lib64/glusterfs/3.12.13/rpc-transport/socket.so(+0x9a08)[0x7fdda7bc4a08] (--> /lib64/libglusterfs.so.0(+0x883c4)[0x7fddbae093c4] ))))) 0-vol3-client-1: tearing down socket connection [2018-09-10 19:43:51.582105] D [socket.c:686:__socket_shutdown] 0-vol3-client-1: shutdown() returned -1. Transport endpoint is not connected [2018-09-10 19:43:51.582111] D [socket.c:733:__socket_disconnect] 0-vol3-client-1: __socket_teardown_connection () failed: Transport endpoint is not connected [2018-09-10 19:43:51.582116] D [socket.c:2474:socket_event_handler] 0-transport: EPOLLERR - disconnecting now [2018-09-10 19:43:51.582812] D [rpc-clnt-ping.c:99:rpc_clnt_remove_ping_timer_locked] (--> /lib64/libglusterfs.so.0(_gf_log_callingfn+0x13b)[0x7fddbadade9b] (--> /lib64/libgfrpc.so.0(rpc_clnt_remove_ping_timer_locked+0x8b)[0x7fddbab7828b] (--> /lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x5f)[0x7fddbab7460f] (--> /lib64/libgfrpc.so.0(rpc_clnt_notify+0x2a0)[0x7fddbab75130] (--> /lib64/libgfrpc.so.0(rpc_transport_notify+0x23)[0x7fddbab70ea3] ))))) 0-: 10.xxx.xxx.130:24007: ping timer event already removed [2018-09-10 19:43:51.582865] D [rpc-clnt-ping.c:99:rpc_clnt_remove_ping_timer_locked] (--> /lib64/libglusterfs.so.0(_gf_log_callingfn+0x13b)[0x7fddbadade9b] (--> /lib64/libgfrpc.so.0(rpc_clnt_remove_ping_timer_locked+0x8b)[0x7fddbab7828b] (--> /lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x5f)[0x7fddbab7460f] (--> /lib64/libgfrpc.so.0(rpc_clnt_notify+0x2a0)[0x7fddbab75130] (--> /lib64/libgfrpc.so.0(rpc_transport_notify+0x23)[0x7fddbab70ea3] ))))) 0-: 10.xxx.xxx.130:24007: ping timer event already removed [2018-09-10 19:43:51.884349] T [rpc-clnt.c:406:rpc_clnt_reconnect] 0-vol1-client-2: attempting reconnect [2018-09-10 19:43:51.884367] T [socket.c:3133:socket_connect] 0-vol1-client-2: connecting 0x55ed673525c0, state=2 gen=0 sock=-1 [2018-09-10 19:43:51.884376] T [name.c:243:af_inet_client_get_remote_sockaddr] 0-vol1-client-2: option remote-port missing in volume vol1-client-2. Defaulting to 24007 [2018-09-10 19:43:51.886644] T [rpc-clnt.c:406:rpc_clnt_reconnect] 0-vol3-client-1: attempting reconnect [2018-09-10 19:43:51.886659] T [socket.c:3133:socket_connect] 0-vol3-client-1: connecting 0x55ed673546c0, state=2 gen=0 sock=-1 [2018-09-10 19:43:51.886669] T [name.c:243:af_inet_client_get_remote_sockaddr] 0-vol3-client-1: option remote-port missing in volume vol3-client-1. Defaulting to 24007 [2018-09-10 19:43:51.887251] D [socket.c:3051:socket_fix_ssl_opts] 0-vol1-client-2: disabling SSL for portmapper connection [2018-09-10 19:43:51.887281] T [socket.c:834:__socket_nodelay] 0-vol1-client-2: NODELAY enabled for socket 38 [2018-09-10 19:43:51.887290] T [socket.c:920:__socket_keepalive] 0-vol1-client-2: Keep-alive enabled for socket: 38, (idle: 20, interval: 2, max-probes: 9, timeout: 0) [2018-09-10 19:43:51.889141] D [socket.c:3051:socket_fix_ssl_opts] 0-vol3-client-1: disabling SSL for portmapper connection : _______________________________________________ Users mailing list -- users@ovirt.org To unsubscribe send an email to users-leave@ovirt.org Privacy Statement: https://www.ovirt.org/site/privacy-policy/ oVirt Code of Conduct: https://www.ovirt.org/communit y/about/community-guidelines/ List Archives: https://lists.ovirt.org/archiv es/list/users@ovirt.org/message/LZ6HYGEQOPARSLOE64MJUZBML4XOLB5L/

I know about this bug, but this is not what I 've experienced. The virtual machine was up and running, I did not try to start it..

ok I think I figured out what is happening... I am currently running some redundancy tests on ovirt+replica2+arbiter glusterfs This is happening under small file 4k fio random write test. like this fio --randrepeat=1 --ioengine=libaio --direct=1 --gtod_reduce=1 --name=test --filename=test --bs=4k --iodepth=32 --size=5G --readwrite=randwrite I have raid 6 on all 3 gluster servers with capacitor backed raid cache. But it seems that write back cache is not a good option for vm's, when a gluster node went down vm became non responsive, and never recovered.. at some point the whole virtual disk became a mess and had to be deleted. Changing RAID controller cache to write-through and ping timeout on gluster volume network.ping-timeout to 10 seconds seems to improve things.. Still large shard block sizes (128MB and above) sometimes get the vm in a paused like situation like this.. kernel:NMI watchdog: BUG: soft lockup - CPU#4 stuck for 22s! [fio:3232] Small shard block sizes while being slower handling the situation a lot better.. almost no pausing of the vm and a consistent performance. It seems here that the only "safe" option after quite a lot of testing is to have shard block size at 8mb with all disks on one volume. It seems larger shard block sizes, and multiple disks on various volume get the vm non responding and disks trashed.. This setup seems to work for me for the max reliability I can get with libgfapi..
participants (2)
-
g.vasilopoulos@uoc.gr
-
Sahina Bose