On 14 May 2018, at 12:34, Sahina Bose <sabose@redhat.com> wrote:

On Mon, May 14, 2018 at 4:07 PM, Mark Betham <mark.betham@googlemail.com> wrote:
Hi Sahina,

Many thanks for your response and apologies for my delay in getting back to you.

How was the schedule created - is this using the Remote Data Sync Setup under Storage domain?

Ovirt is configured in ‘Gluster’ mode, no VM support. When snapshotting we are taking a snapshot of the full Gluster volume.

To configure the snapshot schedule I did the following;
Login to Ovirt WebUI
From left hand menu select ‘Storage’ and ‘Volumes'
I then selected the volume I wanted to snapshot by clicking on the link within the ‘Name’ column
From here I selected the ‘Snapshots’ tab
From the top menu options I selected the drop down ‘Snapshot’
From the drop down options I selected ‘New’
A new window appeared titled ‘Create/Schedule Snapshot’
I entered a snapshot prefix and description into the available fields and selected the ‘Schedule’ page
On the schedule page I selected ‘Minute’ from the ‘Recurrence’ drop down
Set ‘Interval’ to every ’30’ minutes
Changed timezone to ‘Europe/London=(GMT+00:00) London Standard Time’
Left value in ‘Start Schedule by’ at default value
Set schedule to ‘No End Date’
Click 'OK'

Interestingly I get the following message on the ‘Create/Schedule Snapshot’ page before clicking on OK;
Frequent creation of snapshots would overload the cluster
Gluster CLI based snapshot scheduling is enabled. It would be disabled once volume snapshots scheduled from UI.

What is interesting is that I have not enabled 'Gluster CLI based snapshot scheduling’.

After clicking OK I am returned to the Volume Snapshots tab.

From this point I get no snapshots created according to the schedule set.

At the time of clicking OK in the WebUI to enable the schedule I get the following in the engine log;
2018-05-14 09:24:11,068Z WARN [org.ovirt.engine.core.dal.job.ExecutionMessageDirector] (default task-128) [85d0b16f-2c0c-464f-bbf1-682c062a4871] The message key 'ScheduleGlusterVolumeSnapshot' is missing from 'bundles/ExecutionMessages'
2018-05-14 09:24:11,090Z INFO [org.ovirt.engine.core.bll.gluster.ScheduleGlusterVolumeSnapshotCommand] (default task-128) [85d0b16f-2c0c-464f-bbf1-682c062a4871] Before acquiring and wait lock 'EngineLock:{exclusiveLocks='[712da1df-4c11-405a-8fb6-f99aebc185c1=GLUSTER_SNAPSHOT]', sharedLocks=''}'
2018-05-14 09:24:11,090Z INFO [org.ovirt.engine.core.bll.gluster.ScheduleGlusterVolumeSnapshotCommand] (default task-128) [85d0b16f-2c0c-464f-bbf1-682c062a4871] Lock-wait acquired to object 'EngineLock:{exclusiveLocks='[712da1df-4c11-405a-8fb6-f99aebc185c1=GLUSTER_SNAPSHOT]', sharedLocks=''}'
2018-05-14 09:24:11,111Z INFO [org.ovirt.engine.core.bll.gluster.ScheduleGlusterVolumeSnapshotCommand] (default task-128) [85d0b16f-2c0c-464f-bbf1-682c062a4871] Running command: ScheduleGlusterVolumeSnapshotCommand internal: false. Entities affected : ID: 712da1df-4c11-405a-8fb6-f99aebc185c1 Type: GlusterVolumeAction group MANIPULATE_GLUSTER_VOLUME with role type ADMIN
2018-05-14 09:24:11,148Z INFO [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (default task-128) [85d0b16f-2c0c-464f-bbf1-682c062a4871] EVENT_ID: GLUSTER_VOLUME_SNAPSHOT_SCHEDULED(4,134), Snapshots scheduled on volume glustervol0 of cluster NOSS-LD5.
2018-05-14 09:24:11,156Z INFO [org.ovirt.engine.core.bll.gluster.ScheduleGlusterVolumeSnapshotCommand] (default task-128) [85d0b16f-2c0c-464f-bbf1-682c062a4871] Lock freed to object 'EngineLock:{exclusiveLocks='[712da1df-4c11-405a-8fb6-f99aebc185c1=GLUSTER_SNAPSHOT]', sharedLocks=''}'

Could you please provide the engine.log from the time the schedule was setup and including the time the schedule was supposed to run?

The original log file is no longer present, so I removed the old schedule and created a new schedule, as per the instructions above, earlier today. I have therefor attached the engine log from today. The new schedule, which was set to run every 30 minutes, has not produced any snapshots after around 2 hours.

Please let me know if you require any further information.

I see the following messages in logs:
2018-05-14 04:30:00,018Z ERROR [org.ovirt.engine.core.utils.timer.JobWrapper] (QuartzOvirtDBScheduler9) [d0c31a9] Failed to invoke scheduled method onTimer: null

Can you log a bug - and we will dig into this further.

To speed thing up, if you could enable debug logs (I think using https://www.ovirt.org/develop/developer-guide/engine/engine-development-environment/#enable-debug-log---restart-required) , and attach the exception that would help a lot

Many thanks,

Mark Betham.

On Thu, May 3, 2018 at 4:37 PM, Mark Betham <mark.betham@googlemail.com> wrote:
Hi Ovirt community,

I am hoping you will be able to help with a problem I am experiencing when trying to schedule a snapshot of my Gluster volumes using the Ovirt portal.

Below is an overview of the environment;

I have an Ovirt instance running which is managing our Gluster storage. We are running Ovirt version "4.2.2.6-1.el7.centos", Gluster version "glusterfs-3.13.2-2.el7" on a base OS of "CentOS Linux release 7.4.1708 (Core)", Kernel "3.10.0 - 693.21.1.el7.x86_64", VDSM version "vdsm-4.20.23-1.el7.centos". All of the versions of software are the latest release and have been fully patched where necessary.

Ovirt has been installed and configured in "Gluster" mode only, no virtualisation. The Ovirt platform runs from one of the Gluster storage nodes.

Gluster runs with 2 clusters, each located at a different physical site (UK and DE). Each of the storage clusters contain 3 storage nodes. Each storage cluster contains a single gluster volume. The Gluster volume is 3 * Replicated. The Gluster volume runs on top of a LVM thin vol which has been provisioned with a XFS filesystem. The system is running a Geo-rep between the 2 geo-diverse clusters.

The host servers running at the primary site are of specification 1 * Intel(R) Xeon(R) CPU E3-1270 v5 @ 3.60GHz (8 core with HT), 64GB Ram, LSI MegaRAID SAS 9271 with bbu and cache, 8 * SAS 10K 2.5" 1.8TB enterprise drives configured in a RAID 10 array to give 6.52TB of useable space. The host servers running at the secondary site are of specification 1 * Intel(R) Xeon(R) CPU E3-1271 v3 @ 3.60GHz (8 core with HT), 32GB Ram, LSI MegaRAID SAS 9260 with bbu and cache, 8 * SAS 10K 2.5" 1.8TB enterprise drives configured in a RAID 10 array to give 6.52TB of useable space. The secondary site is for DR use only.

When I first starting experiencing the issue and was unable to resolve it, I carried out a full rebuild from scratch across the two storage clusters. I had spent some time troubleshooting the issue but felt it worthwhile to ensure I had a clean platform, void of any potential issues which may be there due to some of the previous work carried out. The platform was rebuilt and data re-ingested. It is probably worth mentioning that this environment will become our new production platform, we will be migrating data and services to this new platform from our existing Gluster storage cluster. The date for the migration activity is getting closer so available time has become an issue and will not permit another full rebuild of the platform without impacting delivery date.

After the rebuild with both storage clusters online, available and managed within the Ovirt platform I conducted some basic commissioning checks and I found no issues. The next step I took at this point was to setup the Geo-replication. This was brought online with no issues and data was seen to be synchronised without any problems. At this point the data re-ingestion was started and the new data was synchronised by the Geo-replication.

The first step in bringing the snapshot schedule online was to validate that snapshots could be taken outside of the scheduler. Taking a manual snapshot via the OVirt portal worked without issue. Several were taken on both primary and secondary clusters. At this point a schedule was created on the primary site cluster via the Ovirt portal to create a snapshot of the storage at hourly intervals. The schedule was created successfully however no snapshots were ever created. Examining the logs did not show anything which I believed was a direct result of the faulty schedule but it is quite possible I missed something.

How was the schedule created - is this using the Remote Data Sync Setup under Storage domain?

I reviewed many online articles, bug reports and application manuals in relation to snapshotting. There were several loosely related support articles around snapshotting but none of the recommendations seemed to work. I did the same with manuals and again nothing that seemed to work. What I did find were several references to running snapshots along with geo-replication and that the geo-replication should be paused when creating. So I removed all existing references to any snapshot schedule, paused the Geo-repl and recreated the snapshot schedule. The schedule was never actioned and no snapshots were created. Removed Geo-repl entirely, remove all schedules and carried out a reboot of the entire platform. When the system was fully back online and no pending heal operations the schedule was re-added for the primary site only. No difference in the results and no snapshots were created from the schedule.

I have now reached the point where I feel I require assistance and hence this email request.

If you require any further data then please let me know and I will do my best to get it for you.

Could you please provide the engine.log from the time the schedule was setup and including the time the schedule was supposed to run?

Any help you can give would be greatly appreciated.

Many thanks,

Mark Betham

_______________________________________________
Users mailing list
Users@ovirt.org
http://lists.ovirt.org/mailman/listinfo/users