Hi Ovirt community,

I am hoping you will be able to help with a problem I am experiencing when trying to schedule a snapshot of my Gluster volumes using the Ovirt portal.

Below is an overview of the environment;

I have an Ovirt instance running which is managing our Gluster storage. We are running Ovirt version "4.2.2.6-1.el7.centos", Gluster version "glusterfs-3.13.2-2.el7" on a base OS of "CentOS Linux release 7.4.1708 (Core)", Kernel "3.10.0 - 693.21.1.el7.x86_64", VDSM version "vdsm-4.20.23-1.el7.centos". All of the versions of software are the latest release and have been fully patched where necessary.

Ovirt has been installed and configured in "Gluster" mode only, no virtualisation. The Ovirt platform runs from one of the Gluster storage nodes.

Gluster runs with 2 clusters, each located at a different physical site (UK and DE). Each of the storage clusters contain 3 storage nodes. Each storage cluster contains a single gluster volume. The Gluster volume is 3 * Replicated. The Gluster volume runs on top of a LVM thin vol which has been provisioned with a XFS filesystem. The system is running a Geo-rep between the 2 geo-diverse clusters.

The host servers running at the primary site are of specification 1 * Intel(R) Xeon(R) CPU E3-1270 v5 @ 3.60GHz (8 core with HT), 64GB Ram, LSI MegaRAID SAS 9271 with bbu and cache, 8 * SAS 10K 2.5" 1.8TB enterprise drives configured in a RAID 10 array to give 6.52TB of useable space. The host servers running at the secondary site are of specification 1 * Intel(R) Xeon(R) CPU E3-1271 v3 @ 3.60GHz (8 core with HT), 32GB Ram, LSI MegaRAID SAS 9260 with bbu and cache, 8 * SAS 10K 2.5" 1.8TB enterprise drives configured in a RAID 10 array to give 6.52TB of useable space. The secondary site is for DR use only.

When I first starting experiencing the issue and was unable to resolve it, I carried out a full rebuild from scratch across the two storage clusters. I had spent some time troubleshooting the issue but felt it worthwhile to ensure I had a clean platform, void of any potential issues which may be there due to some of the previous work carried out. The platform was rebuilt and data re-ingested. It is probably worth mentioning that this environment will become our new production platform, we will be migrating data and services to this new platform from our existing Gluster storage cluster. The date for the migration activity is getting closer so available time has become an issue and will not permit another full rebuild of the platform without impacting delivery date.

After the rebuild with both storage clusters online, available and managed within the Ovirt platform I conducted some basic commissioning checks and I found no issues. The next step I took at this point was to setup the Geo-replication. This was brought online with no issues and data was seen to be synchronised without any problems. At this point the data re-ingestion was started and the new data was synchronised by the Geo-replication.

The first step in bringing the snapshot schedule online was to validate that snapshots could be taken outside of the scheduler. Taking a manual snapshot via the OVirt portal worked without issue. Several were taken on both primary and secondary clusters. At this point a schedule was created on the primary site cluster via the Ovirt portal to create a snapshot of the storage at hourly intervals. The schedule was created successfully however no snapshots were ever created. Examining the logs did not show anything which I believed was a direct result of the faulty schedule but it is quite possible I missed something.

I reviewed many online articles, bug reports and application manuals in relation to snapshotting. There were several loosely related support articles around snapshotting but none of the recommendations seemed to work. I did the same with manuals and again nothing that seemed to work. What I did find were several references to running snapshots along with geo-replication and that the geo-replication should be paused when creating. So I removed all existing references to any snapshot schedule, paused the Geo-repl and recreated the snapshot schedule. The schedule was never actioned and no snapshots were created. Removed Geo-repl entirely, remove all schedules and carried out a reboot of the entire platform. When the system was fully back online and no pending heal operations the schedule was re-added for the primary site only. No difference in the results and no snapshots were created from the schedule.

I have now reached the point where I feel I require assistance and hence this email request.

If you require any further data then please let me know and I will do my best to get it for you.

Any help you can give would be greatly appreciated.

Many thanks,

Mark Betham