Working but unstable storage domains

This message is in MIME format. The first part should be readable text, while the remaining parts are likely unreadable without MIME-aware tools. --0-416135968-1402425401=:2113 Content-Type: TEXT/PLAIN; format=flowed; charset=ISO-8859-15 Content-Transfer-Encoding: 8BIT I'm running oVirt Engine 3.2.0-2.fc18 (which I know is out of date) on a dedicated physical host; we have 12 hosts split between two clusters and nine storage domains, all NFS. Late last week, a VM that in the scope of our clusters consumes a lot of resources failed in migration. Since then the storage domains have from the engine's point of view been going up and down (though the underlying NFS exports are fine). Key symptoms from the oVirt Manager: * two of the storage domains are always marked as having type of "Data (Master)" when historically only one was; * the Manager reports "Storage Pool Manager runs on $host" then "Sync Error on Master Domain..." then "Reconstruct Master Domain ...completed" then "Data Center is being initialized" over and over and over again. The Sync Error messages indicate "$pool is marked as Mater in oVirt Engine database but not on the Storage side. Please consult with Support on how to fix this issue." Note that $pool changes between the various domains that get marked as Data (Master). Clues, anyone? I'm happy to provide logs (though they're all quite large). -- Paul Heinlein heinlein@madboa.com 45°38' N, 122°6' W --0-416135968-1402425401=:2113--

Hi Paul, Can u please attach the logs, Regards, Maor On 06/10/2014 09:36 PM, Paul Heinlein wrote:
I'm running oVirt Engine 3.2.0-2.fc18 (which I know is out of date) on a dedicated physical host; we have 12 hosts split between two clusters and nine storage domains, all NFS.
Late last week, a VM that in the scope of our clusters consumes a lot of resources failed in migration. Since then the storage domains have from the engine's point of view been going up and down (though the underlying NFS exports are fine). Key symptoms from the oVirt Manager:
* two of the storage domains are always marked as having type of "Data (Master)" when historically only one was;
* the Manager reports "Storage Pool Manager runs on $host" then "Sync Error on Master Domain..." then "Reconstruct Master Domain ...completed" then "Data Center is being initialized" over and over and over again.
The Sync Error messages indicate "$pool is marked as Mater in oVirt Engine database but not on the Storage side. Please consult with Support on how to fix this issue." Note that $pool changes between the various domains that get marked as Data (Master).
Clues, anyone? I'm happy to provide logs (though they're all quite large).
_______________________________________________ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users

This message is in MIME format. The first part should be readable text, while the remaining parts are likely unreadable without MIME-aware tools. --0-1470733975-1403110740=:9285 Content-Type: TEXT/PLAIN; charset=ISO-8859-15; format=flowed Content-Transfer-Encoding: 8BIT Content-ID: <alpine.OSX.2.11.1406180959361.9285@springwater.galois.com> On Tue, 10 Jun 2014, Paul Heinlein wrote:
I'm running oVirt Engine 3.2.0-2.fc18 (which I know is out of date) on a dedicated physical host; we have 12 hosts split between two clusters and nine storage domains, all NFS.
Late last week, a VM that in the scope of our clusters consumes a lot of resources failed in migration. Since then the storage domains have from the engine's point of view been going up and down (though the underlying NFS exports are fine). Key symptoms from the oVirt Manager:
* two of the storage domains are always marked as having type of "Data (Master)" when historically only one was;
* the Manager reports "Storage Pool Manager runs on $host" then "Sync Error on Master Domain..." then "Reconstruct Master Domain ...completed" then "Data Center is being initialized" over and over and over again.
The Sync Error messages indicate "$pool is marked as Mater in oVirt Engine database but not on the Storage side. Please consult with Support on how to fix this issue." Note that $pool changes between the various domains that get marked as Data (Master).
For the record, here's what I ended up doing to solve the issue. It took quite a while to identify the moving parts, but the actual fix wasn't terribly hard. The secondary goal was to keep all VMs running during the process, since we had some ongoing benchmarking we didn't want to interrupt. I'll say ahead of time that I didn't like mucking around in postgres, but all my attempts to fix the issue without doing so came to naught. 1. Disable power management on all Hosts: necessary because Hosts would be rebooted if the Engine couldn't contact vdsmd after a relatively short period of time. 2. Shutdown ovirt-engine. 3. Shutdown vdsmd on all Hosts. 4. Query postresql about current master version: SELECT master_domain_version FROM storage_pool; 5. Determine which storage pool has metadata file corresponding to that master version number; it will become the new master. find /rhev/... -type f -name metadata -ls -exec grep ^MASTER {} \; 6. Back up all dom_md directories (with metadata files) and master directories (with *.ovf files). 7. Run database query to set pool with highest master version number to be Data (master). -- make all Data (master) domains into regular Data domains UPDATE storage_domain_static SET storage_domain_type = 1 WHERE storage_domain_type = 0; -- promote one Data domain to be a Data (master) UPDATE storage_domain_static SET storage_domain_type = 1, _update_date = LOCALTIMESTAMP, last_time_used_as_master = extract(epoch from now())::bigint WHERE storage_name = '$pool'; 8. Edit metadata files in non-master pools to read "ROLE=Regular" and "MASTER_VERSION=0"; also remove all _SHA_CKSUM entries. 9. Restart Engine. 10. Restart vdsmd on high-priority SPM Host. 11. Watch that the storage pools come up and stay cleanly. 12. Restart vdsmd on all other Hosts. 13. Enjoy adult beverage. -- Paul Heinlein heinlein@madboa.com 45°38' N, 122°6' W --0-1470733975-1403110740=:9285--
participants (2)
-
Maor Lipchuk
-
Paul Heinlein