[ovirt-users] Ovirt backups lead to unresponsive VM

Tue Feb 13 18:59:16 UTC 2018

Thank you Nir for the below.

I am putting some comments inline in blue.

On Tue, Feb 13, 2018 at 7:33 PM, Nir Soffer <nsoffer at redhat.com> wrote:

> On Wed, Jan 24, 2018 at 3:19 PM Alex K <rightkicktech at gmail.com> wrote:
>
>> Hi all,
>>
>> I have a cluster with 3 nodes, using ovirt 4.1 in a self hosted setup on
>> top glusterfs.
>> On some VMs (especially one Windows server 2016 64bit with 500 GB of
>> disk). Guest agents are installed at VMs. i almost always observe that
>> during the backup of the VM the VM is rendered unresponsive (dashboard
>> shows a question mark at the VM status and VM does not respond to ping or
>> to anything).
>>
>> For scheduled backups I use:
>>
>> https://github.com/wefixit-AT/oVirtBackup
>>
>> The script does the following:
>>
>> 1. snapshot VM (this is done ok without any failure)
>>
>
> This is a very cheap operation
>
>
>> 2. Clone snapshot (this steps renders the VM unresponsive)
>>
>
> This copy 500g of data. In gluster case, it copies 1500g of data, since in
> glusterfs, the client
> is doing the replication.
>
> Maybe your network or gluster server is too slow? Can you describe the
> network topology?
>
> Please attach also the volume info for the gluster volume, maybe it is not
> configured in the
> best way?
>

The network is 1Gbit. The hosts (3 hosts) are decent ones and new hardware
with each host having: 32GB RAM, 16 CPU cores and 2 TB of storage in
RAID10.
The VMS hosted (7 VMs) exhibit high performance. The VMs are Windows 2016
and Windows10.
The network topology is: two networks defined at ovirt: ovirtmgmt is for
the managment and access network and "storage" is a separate network, where
each server is connected with two network cables at a managed switch with
mode 6 load balancing. this storage network is used for gluster traffic.
Attached the volume configuration.

> 3. Export Clone
>>
>
> This copy 500g to the export domain. If the export domain is on glusterfs
> as well, you
> copy now another 1500g of data.
>
>
Export domain a Synology NAS with NFS share.  If the cloning succeeds then
export is completed ok.

> 4. Delete clone
>>
>> 5. Delete snapshot
>>
>
> Not clear why do you need to clone the vm before you export it, you can
> save half of
> the data copies.
>
Because I cannot export the VM while it is running. It does not provide
such option.

>
> If you 4.2, you can backup the vm *while the vm is running* by:
> - Take a snapshot
> - Get the vm ovf from the engine api
> - Download the vm disks using ovirt-imageio and store the snaphosts in
> your backup
>   storage
> - Delete a snapshot
>
> In this flow, you would copy 500g.
>
> I am not aware about this option. checking quickly at site this seems that
it is still half implemented? Is there any script that I may use and test
this? I am interested to have these backups scheduled.

> Daniel, please correct me if I'm wrong regarding doing this online.
>
> Regardless, a vm should not become non-responsive while cloning. Please
> file a bug
> for this and attach engine, vdsm, and glusterfs logs.
>
>
Nir
>
> Do you have any similar experience? Any suggestions to address this?
>>
>> I have never seen such issue with hosted Linux VMs.
>>
>> The cluster has enough storage to accommodate the clone.
>>
>>
>> Thanx,
>>
>> Alex
>>
>>
>>
>> _______________________________________________
>> Users mailing list
>> Users at ovirt.org
>> http://lists.ovirt.org/mailman/listinfo/users
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ovirt.org/pipermail/users/attachments/20180213/01153293/attachment.html>
-------------- next part --------------
Volume Name: vms
Type: Replicate
Volume ID: 00fee7f3-76e6-42b2-8f66-606b91df4a97
Status: Started
Snapshot Count: 0
Number of Bricks: 1 x 3 = 3
Transport-type: tcp
Bricks:
Brick1: gluster2:/gluster/vms/brick
Brick2: gluster0:/gluster/vms/brick
Brick3: gluster1:/gluster/vms/brick
Options Reconfigured:
features.shard-block-size: 512MB
server.allow-insecure: on
performance.strict-o-direct: on
network.ping-timeout: 30
storage.owner-gid: 36
storage.owner-uid: 36
user.cifs: off
features.shard: on
cluster.shd-wait-qlength: 10000
cluster.shd-max-threads: 8
cluster.locking-scheme: granular
cluster.data-self-heal-algorithm: full
cluster.server-quorum-type: server
cluster.quorum-type: auto
cluster.eager-lock: enable
network.remote-dio: on
performance.low-prio-threads: 32
performance.stat-prefetch: off
performance.io-cache: off
performance.read-ahead: off
performance.quick-read: off
transport.address-family: inet
performance.readdir-ahead: off
nfs.disable: on
nfs.export-volumes: on
cluster.granular-entry-heal: enable
performance.cache-size: 1GB
server.event-threads: 4
client.event-threads: 4
[root at v0 setel]#