Re: [ovirt-users] [SOLVED] Working but unstable storage domains

18 Jun 2014

      This message is in MIME format.  The first part should be readable text,
  while the remaining parts are likely unreadable without MIME-aware tools.

--0-1470733975-1403110740=:9285
Content-Type: TEXT/PLAIN; charset=ISO-8859-15; format=flowed
Content-Transfer-Encoding: 8BIT
Content-ID: <alpine.OSX.2.11.1406180959361.9285@springwater.galois.com>

On Tue, 10 Jun 2014, Paul Heinlein wrote:
...
I'm running oVirt Engine 3.2.0-2.fc18 (which I know is out of date) 
on a dedicated physical host; we have 12 hosts split between two 
clusters and nine storage domains, all NFS.
Late last week, a VM that in the scope of our clusters consumes a 
lot of resources failed in migration. Since then the storage domains 
have from the engine's point of view been going up and down (though 
the underlying NFS exports are fine). Key symptoms from the oVirt 
Manager:
* two of the storage domains are always marked as having type of
   "Data (Master)" when historically only one was;
* the Manager reports "Storage Pool Manager runs on $host" then
   "Sync Error on Master Domain..." then "Reconstruct Master Domain
   ...completed" then "Data Center is being initialized" over and
   over and over again.
The Sync Error messages indicate "$pool is marked as Mater in oVirt 
Engine database but not on the Storage side. Please consult with 
Support on how to fix this issue." Note that $pool changes between 
the various domains that get marked as Data (Master).
For the record, here's what I ended up doing to solve the issue. It 
took quite a while to identify the moving parts, but the actual fix 
wasn't terribly hard.

The secondary goal was to keep all VMs running during the process, 
since we had some ongoing benchmarking we didn't want to interrupt.

I'll say ahead of time that I didn't like mucking around in postgres, 
but all my attempts to fix the issue without doing so came to naught.

1. Disable power management on all Hosts: necessary because Hosts
    would be rebooted if the Engine couldn't contact vdsmd after
    a relatively short period of time.

2. Shutdown ovirt-engine.

3. Shutdown vdsmd on all Hosts.

4. Query postresql about current master version:

SELECT master_domain_version FROM storage_pool;

5. Determine which storage pool has metadata file corresponding to
    that master version number; it will become the new master.

find /rhev/... -type f -name metadata -ls -exec grep ^MASTER {} \;

6. Back up all dom_md directories (with metadata files) and master 
directories (with *.ovf files).

7. Run database query to set pool with highest master version number
    to be Data (master).

-- make all Data (master) domains into regular Data domains
UPDATE storage_domain_static
SET storage_domain_type = 1
WHERE storage_domain_type = 0;

-- promote one Data domain to be a Data (master)
UPDATE storage_domain_static
SET
   storage_domain_type = 1,
   _update_date = LOCALTIMESTAMP,
   last_time_used_as_master = extract(epoch from now())::bigint
WHERE
   storage_name = '$pool';

8. Edit metadata files in non-master pools to read "ROLE=Regular"
    and "MASTER_VERSION=0"; also remove all _SHA_CKSUM entries.

9. Restart Engine.

10. Restart vdsmd on high-priority SPM Host.

11. Watch that the storage pools come up and stay cleanly.

12. Restart vdsmd on all other Hosts.

13. Enjoy adult beverage.

-- 
Paul Heinlein
heinlein@madboa.com
45°38' N, 122°6' W
--0-1470733975-1403110740=:9285--