(sorry for double post earlier)
Le 25/11/2018 à 18:19, Nir Soffer a écrit :>
Hello,
Our cluster use NFS storage (on each node as data, and on a NAS for
export)
Looks like you invented your own hyperconverge solution.
Well, when we looks at the
network specs for Gluster to works, we
thinked initially that was a more reasonable solution. Seems that might
be true for VM, but deep wrong for the engine for any case.
Regarding the engine, we thinked (once again), that the backup will
allow us to rebuild in case of problem. Wrong again, backup won't work
with new installation, and even if it's looks like it works on things
that haven't changed, it doesn't.
The NFS server on the node will always be accessible on the node, but if
it is
not accessible from other hosts, the entire DC may go down.
Also you probably don't have any redundancy, so failure of single node cause
downtime and data loss.
We have RAID and regular backups, but that's not good
enough, of course.
Yes, we will try to rebuild our cluster with that, at last for engine.
We currently have a huge problem on our cluster, and export VM as OVA
takes lot's of time.
The huge problem is slow export to OVA or there is another problem?
At this time, we have mostly all data cover by backup or OVA, but nearly
all VMs are down and can't be restarted (i explain why at the end of the
mail). Some export in OVA failed (probably, but not only), because of of
free space on storage. So we have to rebuild a clean cluster without
destroying all the data we still have on our disks. (Import domains
function should do the trick, i hope)
I checked on the NFS storage, and it looks like there's files who might
be OVA in the images directory,
I don't know the ova export code, but I'm sure it does not save in the
images
directory. It probably creates temporary volume for preparing and
storing the
exported ova file.
Arik, how do you suggest to debug slow export to OVA?
That might not be a bug, some
of them make hundred of Gb, it can be
"normal". Anyway, files doesn't have the .ova extension (but the size
matchs the vms)
and OVF files in the the master/vms
OVF should be stored in OVF_STORE disks. Maybe you see files
created by an old version?
well, it was file with ovf extension on NFS, but i might
be wrong, it's
this type of path :
c3c17a66-52e8-42dc-9c09-3e667e4c7290/master/vms/0265bf6b-7bbb-44b1-87f6-7cb5552d12c2/0265bf6b-7bbb-44b1-87f6-7cb5552d12c2.ovf
but it maybe only on exports domains.
A little after my mail, and like you mention it under, i heard about the
"import domain" function in storage domains, which makes me hope my mail
was meaningless, i'll try it in a few hours with true vm inside.
Can someone tell me if, when i install a new engine,
there's a way to get this VMs back inside the new engine (like the
import tools by example)
What do yo mean by "get this VMs back"?
If you mean importing all vms and disks on storage to a new engine,
yest, it should work. This is the base for oVirt DR support.
Yes, thank you.
ps : it should be said in the documentation to NEVER use backup of an
engine when he is in a NFS storage domain on a node. It looks like it's
working, but all the data are unsynch with reality.
Do you mean hosted engine stored on NFS storage domain served by
one of the nodes?
Yes
Can you give more details on this problem?
ovirt version 4.2.7
I'll try to make it short, but it's a weeks worth of stress and wrong
decisions.
We have build our cluster with a few nodes, but our whole storage are on
the nodes (the reason we choose NFS). And we put our engine on one of
this node in a NFS share. We had regular backup. I saw someday that the
status of this node was detoriated (on nodectl check), and it
recommanded us to make a lvs to check.
[ Small precision, if needed, the node has a three disks, merge in a
physical raid5. The installation of the node was a standard ovirt
partitionning except for one thing : we reduced the / part (without size
problem, it's more than 100Gb), to make a separate part in xfs to store
the vm data, this part have shares with the engine, data (vms) and iso
(export is on a NAS). ]
When I check with lvs, the data partition was used at 99.97% (!) when
the df says 55% (spoiler alert, df was right, but who cares).
There's a few days, it wasn't 99.97% but 99.99% (after a log collector,
love the irony) and the whole node crash, with engine on it, of course.
I restarted the cluster on another node, without too much trouble. Then
I looked how to repair the node where the engine was stored.
It seems there were no real solution to clean the lvs (if it was what we
should have done), so i decided to rebuild the whole node, and to
reinject the backup inside the node (it seems it's impossible to move
the engine once it's stored).
Well, i did that, and it has been hell since. It looks like the old
engine, but there's no control other vms, and a few things point on non
existing part (like the engine storage, the engine vm). Mostly i have a
tons of error "command Get Host Capabilities failed: General SSLEngine
problem" on the rebuilded node, but for what i see, the management of
the vm is whole gone on all nodes. Worse thing is that i didn't realise
only after some of the VMs crashed, without being able to restart them.
(I try to virsh, but it might take more time than makes things clean)
If i have rebuild a cluster without the backup, I think I might have
been able to rebuild the cluster quickly, but I might be wrong again.
One last thing, it seems the "lvs vs df" discrepancy things is happening
now on other nodes. So, since we can't full glusterify all vms (some of
them are more than 0.5 Tb, and network won't be able to follow), we
think to keep them in NFS, but on a partition outside of LVM, created
after the ovirt node installation (since we don't know exactly what is
the problem or how to solve it). If the problem is interessing for some
of you, i have still live examples (but maybe not for long, we have to
rebuild this thing quickly). And I will welcome of course any advice or
tests you might have.
Finally, I'm really sorry, of course, for both the question, and not
doing things exactly by the books (we have to work with the hardware we
have), and more than anything to miss some functions pretty obvious
(like the import domain ofc). If i have time later, i'll try to make an
article about noob mistakes to not make on oVirt :)
Anyway, thank you for your answer.
Please also specify oVirt version you use.
Nir
regards,
Alexis Grillon
Pôle Humanités Numériques, Outils, Méthodes et Analyse de Données
Maison européenne des sciences de l'homme et de la société
MESHS - Lille Nord de France / CNRS
tel. +33 (0)3 20 12 58 57 | alexis.grillon(a)meshs.fr
www.meshs.fr | 2, rue des Canonniers 59000 Lille
GPG fingerprint AC37 4C4B 6308 975B 77D4 772F 214F 1E97 6C08 CD11