Hi Ovirt community,
I am hoping you will be able to help with a problem I am experiencing when
trying to schedule a snapshot of my Gluster volumes using the Ovirt portal.
Below is an overview of the environment;
I have an Ovirt instance running which is managing our Gluster storage. We
are running Ovirt version "4.2.2.6-1.el7.centos", Gluster version
"glusterfs-3.13.2-2.el7" on a base OS of "CentOS Linux release 7.4.1708
(Core)", Kernel "3.10.0 - 693.21.1.el7.x86_64", VDSM version
"vdsm-4.20.23-1.el7.centos". All of the versions of software are the
latest release and have been fully patched where necessary.
Ovirt has been installed and configured in "Gluster" mode only, no
virtualisation. The Ovirt platform runs from one of the Gluster storage
nodes.
Gluster runs with 2 clusters, each located at a different physical site (UK
and DE). Each of the storage clusters contain 3 storage nodes. Each
storage cluster contains a single gluster volume. The Gluster volume is 3
* Replicated. The Gluster volume runs on top of a LVM thin vol which has
been provisioned with a XFS filesystem. The system is running a Geo-rep
between the 2 geo-diverse clusters.
The host servers running at the primary site are of specification 1 *
Intel(R) Xeon(R) CPU E3-1270 v5 @ 3.60GHz (8 core with HT), 64GB Ram, LSI
MegaRAID SAS 9271 with bbu and cache, 8 * SAS 10K 2.5" 1.8TB enterprise
drives configured in a RAID 10 array to give 6.52TB of useable space. The
host servers running at the secondary site are of specification 1 *
Intel(R) Xeon(R) CPU E3-1271 v3 @ 3.60GHz (8 core with HT), 32GB Ram, LSI
MegaRAID SAS 9260 with bbu and cache, 8 * SAS 10K 2.5" 1.8TB enterprise
drives configured in a RAID 10 array to give 6.52TB of useable space. The
secondary site is for DR use only.
When I first starting experiencing the issue and was unable to resolve it,
I carried out a full rebuild from scratch across the two storage clusters.
I had spent some time troubleshooting the issue but felt it worthwhile to
ensure I had a clean platform, void of any potential issues which may be
there due to some of the previous work carried out. The platform was
rebuilt and data re-ingested. It is probably worth mentioning that this
environment will become our new production platform, we will be migrating
data and services to this new platform from our existing Gluster storage
cluster. The date for the migration activity is getting closer so
available time has become an issue and will not permit another full rebuild
of the platform without impacting delivery date.
After the rebuild with both storage clusters online, available and managed
within the Ovirt platform I conducted some basic commissioning checks and I
found no issues. The next step I took at this point was to setup the
Geo-replication. This was brought online with no issues and data was seen
to be synchronised without any problems. At this point the data
re-ingestion was started and the new data was synchronised by the
Geo-replication.
The first step in bringing the snapshot schedule online was to validate
that snapshots could be taken outside of the scheduler. Taking a manual
snapshot via the OVirt portal worked without issue. Several were taken on
both primary and secondary clusters. At this point a schedule was created
on the primary site cluster via the Ovirt portal to create a snapshot of
the storage at hourly intervals. The schedule was created successfully
however no snapshots were ever created. Examining the logs did not show
anything which I believed was a direct result of the faulty schedule but it
is quite possible I missed something.
I reviewed many online articles, bug reports and application manuals in
relation to snapshotting. There were several loosely related support
articles around snapshotting but none of the recommendations seemed to
work. I did the same with manuals and again nothing that seemed to work.
What I did find were several references to running snapshots along with
geo-replication and that the geo-replication should be paused when
creating. So I removed all existing references to any snapshot schedule,
paused the Geo-repl and recreated the snapshot schedule. The schedule was
never actioned and no snapshots were created. Removed Geo-repl entirely,
remove all schedules and carried out a reboot of the entire platform. When
the system was fully back online and no pending heal operations the
schedule was re-added for the primary site only. No difference in the
results and no snapshots were created from the schedule.
I have now reached the point where I feel I require assistance and hence
this email request.
If you require any further data then please let me know and I will do my
best to get it for you.
Any help you can give would be greatly appreciated.
Many thanks,
Mark Betham