Re: OVirt Gluster Fail

Tuesday, 26 March 2019

Hi Andrea,

My guess is that while node2 was in maintenance , node3 brick(s) have died, or there were
some pending heals.

For backup, you can use anything that  works for KVM, but the hard part is to get the
configuration of each VM. If the VM is running, you can use 'virsh dumpxml domain'
to get the configuration of the running VM, but this won't work for VM that are off.

Why firewalld was not stopped  - my guess is a rare bug that is hard to reproduce.

Best Regards,
Strahil Nikolov

On Mar 26, 2019 17:10, Andrea Milan <commramius(a)tiscali.it&gt; wrote:
...

 Hi Sahina, Strahil

 thank you for the information, I managed to start the heal and restore both the hosted
engine and the Vms.

 This is the log’s on all nodes

 [2019-03-26 08:30:58.462329] I [MSGID: 104045] [glfs-master.c:91:notify] 0-gfapi: New
graph 676c6e6f-6465-3032-2e61-736370642e6c (0) coming up

 [2019-03-26 08:30:58.462364] I [MSGID: 114020] [client.c:2356:notify] 0-asc-client-0:
parent translators are ready, attempting connect on transport

 [2019-03-26 08:30:58.464374] I [MSGID: 114020] [client.c:2356:notify] 0-asc-client-1:
parent translators are ready, attempting connect on transport

 [2019-03-26 08:30:58.464898] I [rpc-clnt.c:1965:rpc_clnt_reconfig] 0-asc-client-0:
changing port to 49438 (from 0)

 [2019-03-26 08:30:58.466148] I [MSGID: 114020] [client.c:2356:notify] 0-asc-client-3:
parent translators are ready, attempting connect on transport

 [2019-03-26 08:30:58.468028] E [socket.c:2309:socket_connect_finish] 0-asc-client-0:
connection to 192.170.254.3:49438 failed (Nessun instradamento per l'host)

 [2019-03-26 08:30:58.468054] I [rpc-clnt.c:1965:rpc_clnt_reconfig] 0-asc-client-1:
changing port to 49441 (from 0)

 [2019-03-26 08:30:58.470040] I [rpc-clnt.c:1965:rpc_clnt_reconfig] 0-asc-client-3:
changing port to 49421 (from 0)

 [2019-03-26 08:30:58.471345] I [MSGID: 114057]
[client-handshake.c:1440:select_server_supported_programs] 0-asc-client-1: Using Program
GlusterFS 3.3, Num (1298437), Version (330)

 [2019-03-26 08:30:58.472642] I [MSGID: 114046]
[client-handshake.c:1216:client_setvolume_cbk] 0-asc-client-1: Connected to asc-client-1,
attached to remote volume '/bricks/asc/brick'.

 [2019-03-26 08:30:58.472659] I [MSGID: 114047]
[client-handshake.c:1227:client_setvolume_cbk] 0-asc-client-1: Server and Client
lk-version numbers are not same, reopening the fds

 [2019-03-26 08:30:58.472714] I [MSGID: 108005] [afr-common.c:4387:afr_notify]
0-asc-replicate-0: Subvolume 'asc-client-1' came back up; going online.

 [2019-03-26 08:30:58.472731] I [MSGID: 114035]
[client-handshake.c:202:client_set_lk_version_cbk] 0-asc-client-1: Server lk version = 1

 [2019-03-26 08:30:58.473112] E [socket.c:2309:socket_connect_finish] 0-asc-client-3:
connection to 192.170.254.6:49421 failed (Nessun instradamento per l'host)

 [2019-03-26 08:30:58.473152] W [MSGID: 108001] [afr-common.c:4467:afr_notify]
0-asc-replicate-0: Client-quorum is not met

 [2019-03-26 08:30:58.477699] I [MSGID: 108031]
[afr-common.c:2157:afr_local_discovery_cbk] 0-asc-replicate-0: selecting local read_child
asc-client-1

 [2019-03-26 08:30:58.477804] I [MSGID: 104041] [glfs-resolve.c:885:__glfs_active_subvol]
0-asc: switched to graph 676c6e6f-6465-3032-2e61-736370642e6c (0)

 I analyzed the single nodes and I realized that the firewalld service has been stopped on
all nodes.

 Firewalld re-enabled the heal started automatically, and the “gluster heal volume VOLNAME
info” immediately gave correct connections.

 the recovery of the single bricks started immediatly.

 When finished that I have correctly detected and start the host-engine.

 I wanted to tell you about the sequence that led me to the block:

 1) Node03 in maintenance by hosted-engine.

 2) Maintenance performed and restarted.

 3) Repositioned active node03

 4) Heal automatic controlled with Ovirt Manager.

 5) Heal completed correctly.

 6) Node02 put into maintenance.

 7) During the shutdown of the Node02 some VMs have gone to Pause, Ovirt Manager has
signaled the block of the Node01 and immediately the host-engine has stopped.

 8) Restarted the Node02, I saw that the gluster had the peer but there was no healing
between the nodes.

 I had to close everything, and the situation that presented itself was that of previous
emails.

 Questions:

 - Why did the Node02 in maintenance block Node01?

 - Why was restarting the system not restarting the firewalld service? Is it also managed
by vdsm?

 - What is the correct way to backup virtual machines on an external machine? We use
Ovirt4.1

 - can backup be used outside of Ovirt? Es qemu-kvm standard ...

 Thanks for all.
 Best regards
 Andrea Milan

 Il 25.03.2019 11:53 Sahina Bose ha scritto:
>
> You will first need to restore connectivity between the gluster peers
>
> for heal to work. So restart glusterd on all hosts as Strahil
>
> mentioned, and check if "gluster peer status" returns the other nodes
>
> as connected. If not, please check the glusterd log to see what's
>
> causing the issue. Share the logs if we need to look at it, along with
>
> the version info
>
>
>
>
>
> On Sun, Mar 24, 2019 at 1:08 AM Strahil <hunter86_bg(a)yahoo.com&gt; wrote:
>>
>> Hi Andrea, The cluster volumes might have sharding enabled and thus files larger
than shard size can be recovered only via cluster. You can try to restart gluster on all
nodes and force heal: 1. Kill gluster processes: systemctl stop glusterd
/usr/share/glusterfs/scripts/stop-all-gluster-processes.sh 2. Start gluster: systemctl
start glusterd 3. Force heal: for i in $(gluster volume list); do gluster volume heal $i
full ; done sleep 300 for i in $(gluster volume list); do gluster volume heal $i info
summary ; done Best Regards, Strahil Nikolov On Mar 23, 2019 13:51, commramius(a)tiscali.it
wrote: > > During maintenance of a machine the hosted engine crashed. > At that
point there was no more chance of managing anything. > > The VMs have paused, and
were no longer manageable. > I restarted the machine, but one point all the bricks were
no longer reachable. > > Now I am in a situation where the engine support is no
longer loaded. > > The gluster sees the peers connected and the services turned on
for the various bricks, but fails to heal the messages that I find for each machine are
the following > > # gluster volume heal engine info > Brick
192.170.254.3:/bricks/engine/brick > > . > . > . > > Status: Connected
Number of entries: 190 > > Brick 192.170.254.4:/bricks/engine/brick > Status: Il
socket di destinazione non è connesso > Number of entries: - > > Brick
192.170.254.6:/bricks/engine/brick > Status: Il socket di destinazione non è connesso
> Number of entries: - > > this for all the bricks (some have no heal to do
because the machines inside were turned off). > > In practice all the bricks see
only localhost as connected. > > How can I restore the machines? > Is there a way
to read data from the physical machine and export it so that it can be reused? >
Unfortunately we need to access that data. > > Someone can help me. > > Thanks
Andrea > _______________________________________________ > Users mailing list --
users(a)ovirt.org > To unsubscribe send an email to users-leave(a)ovirt.org > Privacy
Statement: https://www.ovirt.org/site/privacy-policy/ > oVirt Code of Conduct:
https://www.ovirt.org/community/about/community-guidelines/ > List Archives:
https://lists.ovirt.org/archives/list/users@ovirt.org/message/EOIY7ZU4GOE...
_______________________________________________ Users mailing list -- users(a)ovirt.org To
unsubscribe send an email to users-leave(a)ovirt.org Privacy Statement:
https://www.ovirt.org/site/privacy-policy/ oVirt Code of Conduct:
https://www.ovirt.org/community/about/community-guidelines/ List Archives:
https://lists.ovirt.org/archives/list/users@ovirt.org/message/NMBDYBOY4TZ...

 Con OpenStar hai Giga, SMS e i minuti che vuoi da 4,99€ al mese, per sempre. Cambi gratis
quando e come vuoi e in più hai 6 mesi di INFINTY! http://tisca.li/myopen

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011