I forgot to mention that LVM config has to be modified in order to 'inform' local
LVM stack to rely on clvmd/dlm for locking purposes.
Yet, this brings abother layer of complexity which I prefer to avoid , thus I use HA-LVM
on my pacemaker clusters.
@Martin,
Check the link from Benny and if possible check if the 2 cases are related.
Best Regards,
Strahil NikolovOn Jul 24, 2019 11:07, Benny Zlotnik <bzlotnik(a)redhat.com> wrote:
We have seen something similar in the past and patches were posted to deal with this
issue, but it's still in progress[1]
[
1] https://bugzilla.redhat.com/show_bug.cgi?id=1553133
On Mon, Jul 22, 2019 at 8:07 PM Strahil <hunter86_bg(a)yahoo.com> wrote:
>
> I have a theory... But after all without any proof it will remain theory.
>
> The storage volumes are just VGs over a shared storage.The SPM host is supposed to be
the only one that is working with the LVM metadata, but I have observed that when someone
is executing a simple LVM command (for example -lvs, vgs or pvs ) while another one is
going on on another host - your metadata can corrupt, due to lack of clvmd.
>
> As a protection, I could offer you to try the following solution:
> 1. Create new iSCSI lun
> 2. Share it to all nodes and create the storage domain. Set it to maintenance.
> 3. Start dlm & clvmd services on all hosts
> 4. Convert the VG of your shared storage domain to have a 'cluster'-ed
flag:
> vgchange -c y mynewVG
> 5. Check the lvs of that VG.
> 6. Activate the storage domain.
>
> Of course test it on a test cluster before inplementing it on Prod.
> This is one of the approaches used in Linux HA clusters in order to avoid LVM
metadata corruption.
>
> Best Regards,
> Strahil Nikolov
>
> On Jul 22, 2019 15:46, Martijn Grendelman <Martijn.Grendelman(a)isaac.nl> wrote:
>>
>> Hi,
>>
>> Op 22-7-2019 om 14:30 schreef Strahil:
>>>
>>> If you can give directions (some kind of history) , the dev might try to
reproduce this type of issue.
>>>
>>> If it is reproduceable - a fix can be provided.
>>>
>>> Based on my experience, if something as used as Linux LVM gets broken, the
case is way hard to reproduce.
>>
>>
>> Yes, I'd think so too, especially since this activity (online moving of disk
images) is done all the time, mostly without problems. In this case, there was a lot of
activity on all storage domains, because I'm moving all my storage (> 10TB in 185
disk images) to a new storage platform. During the online move of one the images, the
metadata checksum became corrupted and the storage domain went offline.
>>
>> Of course, I could dig up the engine logs and vdsm logs of when it happened, but
that would be some work and I'm not very confident that the actual cause would be in
there.
>>
>> If any oVirt devs are interested in the logs, I'll provide them, but
otherwise I think I'll just see it as an incident and move on.
>>
>> Best regards,
>> Martijn.
>>
>>
>>
>>
>> On Jul 22, 2019 10:17, Martijn Grendelman <Martijn.Grendelman(a)isaac.nl>
wrote:
>>>>
>>>> Hi,
>>>>
>>>> Thanks for the tips! I didn't know about 'pvmove', thanks.
>>>>
>>>> In the mean time, I managed to get it fixed by restoring the VG metadata
on the iSCSI server, so on the underlying Zvol directly, rather than via the iSCSI session
on the oVirt host. That allowed me to perform the restore without bringing all VMs down,
which was important to me, because if I had to shut down VMs, I was sure I wouldn't be
able to restart them before the storage domain was back online.
>>>>
>>>> Of course this is a more a Linux problem than an oVirt problem, but oVirt
did cause it ;-)
>>>>
>>>> Thanks,
>>>> Martijn.
>>>>
>>>>
>>>>
>>>> Op 19-7-2019 om 19:06 schreef Strahil Nikolov:
>>>>>
>>>>> Hi Martin,
>>>>>
>>>>> First check what went wrong with the VG -as it could be something
simple.
>>>>> vgcfgbackup -f VGname will create a file which you can use to compare
current metadata with a previous version.
>>>>>
>>>>> If you have Linux boxes - you can add disks from another storage an