Disable "balancing" and authomatic migration

Hello, in my installation I have to use poor storage... the oVirt installation doesn't manage such a case and begin to "balance" and move VMs around... taking too many snapshots stressing a poor performance all the cluster mess up.... Why the vms don't go in "Pause" state but the cluster prefer to migrate things around messing up everything? This is a reference I found and for notice I'm disabling the auto-migration on every VM, hoping this help https://lists.ovirt.org/archives/list/users@ovirt.org/thread/24KQZFP2PCW462U...

On Tue, Mar 28, 2023 at 11:50 AM Diego Ercolani <diego.ercolani@ssis.sm> wrote:
Hello, in my installation I have to use poor storage... the oVirt installation doesn't manage such a case and begin to "balance" and move VMs around... taking too many snapshots stressing a poor performance all the cluster mess up.... Why the vms don't go in "Pause" state but the cluster prefer to migrate things around messing up everything? This is a reference I found and for notice I'm disabling the auto-migration on every VM, hoping this help
https://lists.ovirt.org/archives/list/users@ovirt.org/thread/24KQZFP2PCW462U...
What is your current scheduling policy for the related oVirt cluster? What is the event/error you see in engine.log of the engine and vdsm.log of the host previously carrying on the VM when it happens? Gianluca

The scheduling policy was the "Suspend Workload if needed" and disabled parallel migration. The problem is that The Engine (mapped on external NFS domain implemented by a linux box without any other vm mapped) simply disappear. I have a single 10Gbps intel ethernet link that I use to distribute storage, management and "production" networks, but I don't record any bandwidth limit issue

I record entry like this in the journal of everynode: Mar 28 10:26:58 ovirt-node3.ovirt sanlock[1660]: 2023-03-28 10:26:58 1191247 [4105511]: s9 delta_renew read timeout 10 sec offset 0 /rhev/data-center/mnt/glusterSD/ovirt-node3.ovirt:_gv0/4745320f-bfc3-46c4-8849-b4fe8f1b2de6/dom_md/ids Mar 28 10:26:58 ovirt-node3.ovirt sanlock[1660]: 2023-03-28 10:26:58 1191247 [4105511]: s9 renewal error -202 delta_length 10 last_success 1191216 Mar 28 10:26:58 ovirt-node3.ovirt sanlock[1660]: 2023-03-28 10:26:58 1191247 [2750073]: s11 delta_renew read timeout 10 sec offset 0 /rhev/data-center/mnt/ovirt-nfsha.ovirt:_dati_drbd0/2527ed0f-e91a-4748-995c-e644362e8408/dom_md/ids Mar 28 10:26:58 ovirt-node3.ovirt sanlock[1660]: 2023-03-28 10:26:58 1191247 [2750073]: s11 renewal error -202 delta_length 10 last_success 1191217 as You see its complaining about a gluster volume (hosting vms and mapped on three node with the terrible SATA SSD: Samsung_SSD_870_EVO_4TB

On Tue, Mar 28, 2023 at 12:34 PM Diego Ercolani <diego.ercolani@ssis.sm> wrote:
I record entry like this in the journal of everynode: Mar 28 10:26:58 ovirt-node3.ovirt sanlock[1660]: 2023-03-28 10:26:58 1191247 [4105511]: s9 delta_renew read timeout 10 sec offset 0 /rhev/data-center/mnt/glusterSD/ovirt-node3.ovirt:_gv0/4745320f-bfc3-46c4-8849-b4fe8f1b2de6/dom_md/ids Mar 28 10:26:58 ovirt-node3.ovirt sanlock[1660]: 2023-03-28 10:26:58 1191247 [4105511]: s9 renewal error -202 delta_length 10 last_success 1191216 Mar 28 10:26:58 ovirt-node3.ovirt sanlock[1660]: 2023-03-28 10:26:58 1191247 [2750073]: s11 delta_renew read timeout 10 sec offset 0 /rhev/data-center/mnt/ovirt-nfsha.ovirt:_dati_drbd0/2527ed0f-e91a-4748-995c-e644362e8408/dom_md/ids Mar 28 10:26:58 ovirt-node3.ovirt sanlock[1660]: 2023-03-28 10:26:58 1191247 [2750073]: s11 renewal error -202 delta_length 10 last_success 1191217
as You see its complaining about a gluster volume (hosting vms and mapped on three node with the terrible SATA SSD: Samsung_SSD_870_EVO_4TB
And inside the engine.log file of the engine, when it becomes reachable again?

It's difficult to answer as the engine normally "freezes" or is taken down during events... I will try to get them next time

New event: Mar 28 14:37:32 ovirt-node3.ovirt vdsm[4288]: WARN executor state: count=5 workers={<Worker name=periodic/4 waiting task#=4 at 0x7fce0e9d6048>, <Worker name=periodic/5 waiting task#=0 at 0x7fcda8700860>, <Worker name=periodic/3 waiting task#=188 at 0x7fcdc0010470>, <Worker name=periodic/1 waiting task#=189 at 0x7fcdc00102b0>, <Worker name=periodic/2 running <Task discardable <Operation action=<vdsm.virt.sampling.VMBulkstatsMonitor object at 0x7fcdc007a048> at 0x7fcdc0010898> timeout=7.5, duration=7.50 at 0x7fcdc0010208> discarded task#=189 at 0x7fcdc0010390>} Mar 28 14:37:32 ovirt-node3.ovirt sanlock[1662]: 2023-03-28 14:37:32 829 [7438]: s4 delta_renew read timeout 10 sec offset 0 /rhev/data-center/mnt/glusterSD/ovirt-node3.ovirt:_gv0/4745320f-bfc3-46c4-8849-b4fe8f1b2de6/dom_md/ids Mar 28 14:37:32 ovirt-node3.ovirt sanlock[1662]: 2023-03-28 14:37:32 829 [7438]: s4 renewal error -202 delta_length 10 last_success 798 Mar 28 14:37:33 ovirt-node3.ovirt sanlock[1662]: 2023-03-28 14:37:33 830 [7660]: s6 delta_renew read timeout 10 sec offset 0 /rhev/data-center/mnt/ovirt-nfsha.ovirt:_dati_drbd0/2527ed0f-e91a-4748-995c-e644362e8408/dom_md/ids Mar 28 14:37:33 ovirt-node3.ovirt sanlock[1662]: 2023-03-28 14:37:33 830 [7660]: s6 renewal error -202 delta_length 10 last_success 799 Mar 28 14:37:36 ovirt-node3.ovirt pacemaker-controld[3145]: notice: High CPU load detected: 32.590000 Mar 28 14:37:36 ovirt-node3.ovirt kernel: drbd drbd0/0 drbd1 ovirt-node2.ovirt: We did not send a P_BARRIER for 14436ms > ko-count (7) * timeout (10 * 0.1s); drbd kernel thread blocked? Mar 28 14:37:41 ovirt-node3.ovirt libvirtd[2735]: Domain id=1 name='SSIS-microos' uuid=e41f8148-79ab-4a88-879f-894d5750e5fb is tainted: custom-ga-command Mar 28 14:37:49 ovirt-node3.ovirt kernel: drbd drbd0/0 drbd1 ovirt-node2.ovirt: We did not send a P_BARRIER for 7313ms > ko-count (7) * timeout (10 * 0.1s); drbd kernel thread blocked? Mar 28 14:37:56 ovirt-node3.ovirt kernel: drbd drbd0/0 drbd1 ovirt-node2.ovirt: We did not send a P_BARRIER for 14481ms > ko-count (7) * timeout (10 * 0.1s); drbd kernel thread blocked? Mar 28 14:38:06 ovirt-node3.ovirt pacemaker-controld[3145]: notice: High CPU load detected: 33.500000 Mar 28 14:38:09 ovirt-node3.ovirt kernel: drbd drbd0/0 drbd1 ovirt-node2.ovirt: Remote failed to finish a request within 7010ms > ko-count (7) * timeout (10 * 0.1s) 2023-03-28 14:37:32,601Z INFO [org.ovirt.engine.core.bll.scheduling.policyunits.EvenGuestDistributionBalancePolicyUnit] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-73) [] There is no host with more than 10 running guests, no balancing is needed 2023-03-28 14:37:50,662Z INFO [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-41) [] VM 'ccb06298-33a3-4b6f-bff3-d0bcd494b18d'(TpayX2GO) moved from 'Up' --> 'NotResponding' 2023-03-28 14:37:50,666Z WARN [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-41) [] EVENT_ID: VM_NOT_RESPONDING(126), VM TpayX2GO is not responding. 2023-03-28 14:38:01,087Z WARN [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyImpl] (EE-ManagedThreadFactory-engine-Thread-6602) [] domain '4745320f-bfc3-46c4-8849-b4fe8f1b2de6:gv0' in problem 'PROBLEMATIC'. vds: 'ovirt-node2.ovirt' 2023-03-28 14:38:05,676Z INFO [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-20) [] VM 'ccb06298-33a3-4b6f-bff3-d0bcd494b18d'(TpayX2GO) moved from 'NotResponding' --> 'Up' 2023-03-28 14:38:16,107Z INFO [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyImpl] (EE-ManagedThreadFactory-engine-Thread-6609) [] Domain '2527ed0f-e91a-4748-995c-e644362e8408:drbd0' recovered from problem. vds: 'ovirt-node2.ovirt' 2023-03-28 14:38:16,107Z INFO [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyImpl] (EE-ManagedThreadFactory-engine-Thread-6609) [] Domain '4745320f-bfc3-46c4-8849-b4fe8f1b2de6:gv0' recovered from problem. vds: 'ovirt-node2.ovirt' 2023-03-28 14:38:16,107Z INFO [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyImpl] (EE-ManagedThreadFactory-engine-Thread-6610) [] Domain '2527ed0f-e91a-4748-995c-e644362e8408:drbd0' recovered from problem. vds: 'ovirt-node4.ovirt' 2023-03-28 14:38:16,107Z INFO [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyImpl] (EE-ManagedThreadFactory-engine-Thread-6610) [] Domain '4745320f-bfc3-46c4-8849-b4fe8f1b2de6:gv0' recovered from problem. vds: 'ovirt-node4.ovirt' 2023-03-28 14:38:16,327Z INFO [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyImpl] (EE-ManagedThreadFactory-engine-Thread-6612) [] Domain '2527ed0f-e91a-4748-995c-e644362e8408:drbd0' recovered from problem. vds: 'ovirt-node3.ovirt' 2023-03-28 14:38:16,327Z INFO [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyImpl] (EE-ManagedThreadFactory-engine-Thread-6612) [] Domain '2527ed0f-e91a-4748-995c-e644362e8408:drbd0' has recovered from problem. No active host in the DC is reporting it as problematic, so clearing the domain recovery timer. 2023-03-28 14:38:16,327Z INFO [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyImpl] (EE-ManagedThreadFactory-engine-Thread-6612) [] Domain '4745320f-bfc3-46c4-8849-b4fe8f1b2de6:gv0' recovered from problem. vds: 'ovirt-node3.ovirt' 2023-03-28 14:38:16,327Z INFO [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyImpl] (EE-ManagedThreadFactory-engine-Thread-6612) [] Domain '4745320f-bfc3-46c4-8849-b4fe8f1b2de6:gv0' has recovered from problem. No active host in the DC is reporting it as problematic, so clearing the domain recovery timer. 2023-03-28 14:38:18,602Z INFO [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-44) [] VM 'e41f8148-79ab-4a88-879f-894d5750e5fb'(SSIS-microos) moved from 'Up' --> 'NotResponding' 2023-03-28 14:38:18,607Z WARN [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-44) [] EVENT_ID: VM_NOT_RESPONDING(126), VM SSIS-microos is not responding. 2023-03-28 14:38:32,603Z INFO [org.ovirt.engine.core.bll.scheduling.policyunits.EvenGuestDistributionBalancePolicyUnit] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-32) [] There is no host with more than 10 running guests, no balancing is needed
participants (2)
-
Diego Ercolani
-
Gianluca Cecchi