On Mon, May 14, 2018 at 4:07 PM, Mark Betham <mark.betham(a)googlemail.com>
wrote:
Hi Sahina,
Many thanks for your response and apologies for my delay in getting back
to you.
How was the schedule created - is this using the Remote Data Sync Setup
under Storage domain?
Ovirt is configured in ‘Gluster’ mode, no VM support. When snapshotting
we are taking a snapshot of the full Gluster volume.
To configure the snapshot schedule I did the following;
Login to Ovirt WebUI
From left hand menu select ‘Storage’ and ‘Volumes'
I then selected the volume I wanted to snapshot by clicking on the link
within the ‘Name’ column
From here I selected the ‘Snapshots’ tab
From the top menu options I selected the drop down ‘Snapshot’
From the drop down options I selected ‘New’
A new window appeared titled ‘Create/Schedule Snapshot’
I entered a snapshot prefix and description into the available fields and
selected the ‘Schedule’ page
On the schedule page I selected ‘Minute’ from the ‘Recurrence’ drop down
Set ‘Interval’ to every ’30’ minutes
Changed timezone to ‘Europe/London=(GMT+00:00) London Standard Time’
Left value in ‘Start Schedule by’ at default value
Set schedule to ‘No End Date’
Click 'OK'
Interestingly I get the following message on the ‘Create/Schedule
Snapshot’ page before clicking on OK;
*Frequent creation of snapshots would overload the cluster*
*Gluster CLI based snapshot scheduling is enabled. It would be disabled
once volume snapshots scheduled from UI.*
What is interesting is that I have not enabled 'Gluster CLI based snapshot
scheduling’.
After clicking OK I am returned to the Volume Snapshots tab.
From this point I get no snapshots created according to the schedule set.
At the time of clicking OK in the WebUI to enable the schedule I get the
following in the engine log;
*2018-05-14 09:24:11,068Z WARN
[org.ovirt.engine.core.dal.job.ExecutionMessageDirector] (default
task-128) [85d0b16f-2c0c-464f-bbf1-682c062a4871] The message key
'ScheduleGlusterVolumeSnapshot' is missing from
'bundles/ExecutionMessages'*
*2018-05-14 09:24:11,090Z INFO
[org.ovirt.engine.core.bll.gluster.ScheduleGlusterVolumeSnapshotCommand]
(default task-128) [85d0b16f-2c0c-464f-bbf1-682c062a4871] Before acquiring
and wait lock
'EngineLock:{exclusiveLocks='[712da1df-4c11-405a-8fb6-f99aebc185c1=GLUSTER_SNAPSHOT]',
sharedLocks=''}'*
*2018-05-14 09:24:11,090Z INFO
[org.ovirt.engine.core.bll.gluster.ScheduleGlusterVolumeSnapshotCommand]
(default task-128) [85d0b16f-2c0c-464f-bbf1-682c062a4871] Lock-wait
acquired to object
'EngineLock:{exclusiveLocks='[712da1df-4c11-405a-8fb6-f99aebc185c1=GLUSTER_SNAPSHOT]',
sharedLocks=''}'*
*2018-05-14 09:24:11,111Z INFO
[org.ovirt.engine.core.bll.gluster.ScheduleGlusterVolumeSnapshotCommand]
(default task-128) [85d0b16f-2c0c-464f-bbf1-682c062a4871] Running command:
ScheduleGlusterVolumeSnapshotCommand internal: false. Entities affected :
ID: 712da1df-4c11-405a-8fb6-f99aebc185c1 Type: GlusterVolumeAction group
MANIPULATE_GLUSTER_VOLUME with role type ADMIN*
*2018-05-14 09:24:11,148Z INFO
[org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector]
(default task-128) [85d0b16f-2c0c-464f-bbf1-682c062a4871] EVENT_ID:
GLUSTER_VOLUME_SNAPSHOT_SCHEDULED(4,134), Snapshots scheduled on volume
glustervol0 of cluster NOSS-LD5.*
*2018-05-14 09:24:11,156Z INFO
[org.ovirt.engine.core.bll.gluster.ScheduleGlusterVolumeSnapshotCommand]
(default task-128) [85d0b16f-2c0c-464f-bbf1-682c062a4871] Lock freed to
object
'EngineLock:{exclusiveLocks='[712da1df-4c11-405a-8fb6-f99aebc185c1=GLUSTER_SNAPSHOT]',
sharedLocks=''}'*
Could you please provide the engine.log from the time the schedule was
setup and including the time the schedule was supposed to run?
The original log file is no longer present, so I removed the old schedule
and created a new schedule, as per the instructions above, earlier today.
I have therefor attached the engine log from today. The new schedule,
which was set to run every 30 minutes, has not produced any snapshots after
around 2 hours.
Please let me know if you require any further information.
I see the following messages in logs:
2018-05-14 04:30:00,018Z ERROR
[org.ovirt.engine.core.utils.timer.JobWrapper] (QuartzOvirtDBScheduler9)
[d0c31a9] Failed to invoke scheduled method onTimer: null
Can you log a bug - and we will dig into this further.
To speed thing up, if you could enable debug logs (I think using
https://www.ovirt.org/develop/developer-guide/engine/engine-development-e...)
, and attach the exception that would help a lot
Many thanks,
Mark Betham.
On Thu, May 3, 2018 at 4:37 PM, Mark Betham <mark.betham(a)googlemail.com>
wrote:
Hi Ovirt community,
I am hoping you will be able to help with a problem I am experiencing when
trying to schedule a snapshot of my Gluster volumes using the Ovirt portal.
Below is an overview of the environment;
I have an Ovirt instance running which is managing our Gluster storage.
We are running Ovirt version "4.2.2.6-1.el7.centos", Gluster version
"glusterfs-3.13.2-2.el7" on a base OS of "CentOS Linux release 7.4.1708
(Core)", Kernel "3.10.0 - 693.21.1.el7.x86_64", VDSM version
"vdsm-4.20.23-1.el7.centos". All of the versions of software are the
latest release and have been fully patched where necessary.
Ovirt has been installed and configured in "Gluster" mode only, no
virtualisation. The Ovirt platform runs from one of the Gluster storage
nodes.
Gluster runs with 2 clusters, each located at a different physical site
(UK and DE). Each of the storage clusters contain 3 storage nodes. Each
storage cluster contains a single gluster volume. The Gluster volume is 3
* Replicated. The Gluster volume runs on top of a LVM thin vol which has
been provisioned with a XFS filesystem. The system is running a Geo-rep
between the 2 geo-diverse clusters.
The host servers running at the primary site are of specification 1 *
Intel(R) Xeon(R) CPU E3-1270 v5 @ 3.60GHz (8 core with HT), 64GB Ram, LSI
MegaRAID SAS 9271 with bbu and cache, 8 * SAS 10K 2.5" 1.8TB enterprise
drives configured in a RAID 10 array to give 6.52TB of useable space. The
host servers running at the secondary site are of specification 1 *
Intel(R) Xeon(R) CPU E3-1271 v3 @ 3.60GHz (8 core with HT), 32GB Ram, LSI
MegaRAID SAS 9260 with bbu and cache, 8 * SAS 10K 2.5" 1.8TB enterprise
drives configured in a RAID 10 array to give 6.52TB of useable space. The
secondary site is for DR use only.
When I first starting experiencing the issue and was unable to resolve it,
I carried out a full rebuild from scratch across the two storage clusters.
I had spent some time troubleshooting the issue but felt it worthwhile to
ensure I had a clean platform, void of any potential issues which may be
there due to some of the previous work carried out. The platform was
rebuilt and data re-ingested. It is probably worth mentioning that this
environment will become our new production platform, we will be migrating
data and services to this new platform from our existing Gluster storage
cluster. The date for the migration activity is getting closer so
available time has become an issue and will not permit another full rebuild
of the platform without impacting delivery date.
After the rebuild with both storage clusters online, available and managed
within the Ovirt platform I conducted some basic commissioning checks and I
found no issues. The next step I took at this point was to setup the
Geo-replication. This was brought online with no issues and data was seen
to be synchronised without any problems. At this point the data
re-ingestion was started and the new data was synchronised by the
Geo-replication.
The first step in bringing the snapshot schedule online was to validate
that snapshots could be taken outside of the scheduler. Taking a manual
snapshot via the OVirt portal worked without issue. Several were taken on
both primary and secondary clusters. At this point a schedule was created
on the primary site cluster via the Ovirt portal to create a snapshot of
the storage at hourly intervals. The schedule was created successfully
however no snapshots were ever created. Examining the logs did not show
anything which I believed was a direct result of the faulty schedule but it
is quite possible I missed something.
How was the schedule created - is this using the Remote Data Sync Setup
under Storage domain?
I reviewed many online articles, bug reports and application manuals in
relation to snapshotting. There were several loosely related support
articles around snapshotting but none of the recommendations seemed to
work. I did the same with manuals and again nothing that seemed to work.
What I did find were several references to running snapshots along with
geo-replication and that the geo-replication should be paused when
creating. So I removed all existing references to any snapshot schedule,
paused the Geo-repl and recreated the snapshot schedule. The schedule was
never actioned and no snapshots were created. Removed Geo-repl entirely,
remove all schedules and carried out a reboot of the entire platform. When
the system was fully back online and no pending heal operations the
schedule was re-added for the primary site only. No difference in the
results and no snapshots were created from the schedule.
I have now reached the point where I feel I require assistance and hence
this email request.
If you require any further data then please let me know and I will do my
best to get it for you.
Could you please provide the engine.log from the time the schedule was
setup and including the time the schedule was supposed to run?
Any help you can give would be greatly appreciated.
Many thanks,
Mark Betham
_______________________________________________
Users mailing list
Users(a)ovirt.org
http://lists.ovirt.org/mailman/listinfo/users