Hi Nir

Yes, the metadata was corrupted but the VMs were running OK. This master storage domain has increased its allocation significantly overnight and ran out the space limit and went to offline completely. The cluster was online and VMs were running OK but the affected Storage Domain went offline. I tired increase the storage domain but the Ovirt wasn’t allowing to expend the storage.

Due to time constrain, I had restore the storage domain using Compellent snapshot. However, we need to prevent this happening again when Master storage Domain fill-up with the space. Currently, we have the following parameter set in the 5TB storage Domain.

ID: 0e1f2a5d-a548-476c-94bd-3ab3fe239926

Size: 5119 GiB

Available: 2361 GiB

Used: 2758 GiB

Allocated: 3104 GiB

Over Allocation Ratio: 14%

Images: 13

Warning Low Space Indicator: 10% (511 GiB)

Critical Space Action Blocker: 5 GiB

Please kindly advise what action needs to implement, so we can prevent this occurs again in the future.

Thanks

Aminur Rahman
aminur.rahman@iongroup.com
t	+44 20 7398 0243
m	+44 7825 780697
iongroup.com

From: Nir Soffer <nsoffer@redhat.com>
Sent: 10 June 2019 22:07
To: David Teigland <teigland@redhat.com>
Cc: Aminur Rahman <aminur.rahman@iongroup.com>; users <users@ovirt.org>
Subject: Re: [ovirt-users] Failed to activate Storage Domain --- ovirt 4.2

On Mon, Jun 10, 2019 at 11:22 PM David Teigland <teigland@redhat.com> wrote:

On Mon, Jun 10, 2019 at 10:59:43PM +0300, Nir Soffer wrote:
> > [root@uk1-ion-ovm-18 pvscan
> > /dev/mapper/36000d310056978000000000000000014: Checksum error at offset
> > 4397954425856
> > Couldn't read volume group metadata from
> > /dev/mapper/36000d310056978000000000000000014.
> > Metadata location on /dev/mapper/36000d310056978000000000000000014 at
> > 4397954425856 has invalid summary for VG.
> > Failed to read metadata summary from
> > /dev/mapper/36000d310056978000000000000000014
> > Failed to scan VG from /dev/mapper/36000d310056978000000000000000014
>
> This looks like corrupted vg metadata.

Yes, the second metadata area, at the end of the device is corrupted; the
first metadata area is probably ok. That version of lvm is not able to
continue by just using the one good copy.

Can we copy the first metadata area into the second metadata area?

Last week I pushed out major changes to LVM upstream to be able to handle
and repair most of these cases. So, one option is to build lvm from the
upstream master branch, and check if that can read and repair this
metadata.

This sound pretty risky for production.

> David, we keep 2 metadata copies on the first PV. Can we use one of the
> copies on the PV to restore the metadata to the least good state?

pvcreate with --restorefile and --uuid, and with the right backup metadata

What would be the right backup metadata?

could probably correct things, but experiment with some temporary PVs
first.

Aminur, can you copy and compress the metadata areas, and shared them somewhere?

To copy the first metadata area, use:

dd if=/dev/mapper/360014058ccaab4857eb40f393aaf0351 of=md1 bs=128M count=1 skip=4096 iflag=skip_bytes

To copy the second metadata area, you need to know the size of the PV. On my setup with 100G

PV, I have 800 extents (128M each), and this works:

dd if=/dev/mapper/360014058ccaab4857eb40f393aaf0351 of=md2 bs=128M count=1 skip=799

gzip md1 md2

Nir