Ovirt 4.3.1 problem with HA agent

Strahil

6 Mar 2019 6 Mar '19

12:13 a.m.

Hi guys, After updating to 4.3.1 I had an issue where the ovirt-ha-broker was complaining that it couldn't ping the gateway. As I have seen that before - I stopped ovirt-ha-agent, ovirt-ha-broker, vdsmd, supervdsmd and sanlock on the nodes and reinitialized the lockspace. I gues s I didn't do it properly as now I receive: ovirt-ha-agent ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine.config.vm ERROR Failed extracting VM OVF from the OVF_STORE volume, falling back to initial vm.conf Any hints how to fix this ? Of course a redeploy is possible, but I prefer to recover from that. Best Regards, Strahil Nikolov

Attachments:

attachment.html (text/html — 732 bytes)

Show replies by date

Simone Tiraboschi

6 Mar 6 Mar

2:52 a.m.

On Wed, Mar 6, 2019 at 6:13 AM Strahil <hunter86_bg@yahoo.com> wrote:

...

Hi guys,

After updating to 4.3.1 I had an issue where the ovirt-ha-broker was complaining that it couldn't ping the gateway.

Are you really sure that the issue was on the ping? on storage errors the broker restart itself and while the broker is restarting the agent cannot ask the broker to trigger the gateway monitor (the ping one) and so that error message.

...

As I have seen that before - I stopped ovirt-ha-agent, ovirt-ha-broker, vdsmd, supervdsmd and sanlock on the nodes and reinitialized the lockspace.

I gues s I didn't do it properly as now I receive:

ovirt-ha-agent ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine.config.vm ERROR Failed extracting VM OVF from the OVF_STORE volume, falling back to initial vm.conf

Any hints how to fix this ? Of course a redeploy is possible, but I prefer to recover from that.

Which kind of storage are you using? can you please attach /var/log/ovirt-hosted-engine-ha/broker.log ?

...

Best Regards, Strahil Nikolov _______________________________________________ Users mailing list -- users@ovirt.org To unsubscribe send an email to users-leave@ovirt.org Privacy Statement: https://www.ovirt.org/site/privacy-policy/ oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/users@ovirt.org/message/OU3FKLEPH7AHT2...

Strahil Nikolov

9:08 a.m.

Hi Simone, thanks for your reply.

...

Are you really sure that the issue was on the ping?>on storage errors the broker restart itself and while the broker is restarting >the agent cannot ask the broker to trigger the gateway monitor (the ping one) and >so that error message. It seemed so in that moment, but I'm not so sure , right now :) Which kind of storage are you using?>can you please attach /var/log/ovirt-hosted-engine-ha/broker.log ? I'm using glustervs v5 from ovirt 4.3.1 with FUSE mount.Please , have a look in the attached logs. Best Regards,Strahil Nikolov

В сряда, 6 март 2019 г., 9:53:20 ч. Гринуич+2, Simone Tiraboschi <stirabos@redhat.com> написа: On Wed, Mar 6, 2019 at 6:13 AM Strahil <hunter86_bg@yahoo.com> wrote: Hi guys, After updating to 4.3.1 I had an issue where the ovirt-ha-broker was complaining that it couldn't ping the gateway. Are you really sure that the issue was on the ping?on storage errors the broker restart itself and while the broker is restarting the agent cannot ask the broker to trigger the gateway monitor (the ping one) and so that error message. As I have seen that before - I stopped ovirt-ha-agent, ovirt-ha-broker, vdsmd, supervdsmd and sanlock on the nodes and reinitialized the lockspace. I gues s I didn't do it properly as now I receive: ovirt-ha-agent ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine.config.vm ERROR Failed extracting VM OVF from the OVF_STORE volume, falling back to initial vm.conf Any hints how to fix this ? Of course a redeploy is possible, but I prefer to recover from that. Which kind of storage are you using?can you please attach /var/log/ovirt-hosted-engine-ha/broker.log ? Best Regards, Strahil Nikolov _______________________________________________ Users mailing list -- users@ovirt.org To unsubscribe send an email to users-leave@ovirt.org Privacy Statement: https://www.ovirt.org/site/privacy-policy/ oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/users@ovirt.org/message/OU3FKLEPH7AHT2... _______________________________________________ Users mailing list -- users@ovirt.org To unsubscribe send an email to users-leave@ovirt.org Privacy Statement: https://www.ovirt.org/site/privacy-policy/ oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/users@ovirt.org/message/BNV7AVUBLOV2UD...

Simone Tiraboschi

9:57 a.m.

On Wed, Mar 6, 2019 at 3:09 PM Strahil Nikolov <hunter86_bg@yahoo.com> wrote:

...

Hi Simone,

thanks for your reply.

...
Are you really sure that the issue was on the ping? on storage errors the broker restart itself and while the broker is restarting >the agent cannot ask the broker to trigger the gateway monitor (the ping one) and >so that error message.

It seemed so in that moment, but I'm not so sure , right now :)

...
Which kind of storage are you using? can you please attach /var/log/ovirt-hosted-engine-ha/broker.log ?

I'm using glustervs v5 from ovirt 4.3.1 with FUSE mount. Please , have a look in the attached logs.

...

Best Regards, Strahil Nikolov

В сряда, 6 март 2019 г., 9:53:20 ч. Гринуич+2, Simone Tiraboschi < stirabos@redhat.com> написа:

On Wed, Mar 6, 2019 at 6:13 AM Strahil <hunter86_bg@yahoo.com> wrote:

Hi guys,

After updating to 4.3.1 I had an issue where the ovirt-ha-broker was complaining that it couldn't ping the gateway.

Are you really sure that the issue was on the ping? on storage errors the broker restart itself and while the broker is restarting the agent cannot ask the broker to trigger the gateway monitor (the ping one) and so that error message.

As I have seen that before - I stopped ovirt-ha-agent, ovirt-ha-broker, vdsmd, supervdsmd and sanlock on the nodes and reinitialized the lockspace.

I gues s I didn't do it properly as now I receive:

ovirt-ha-agent ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine.config.vm ERROR Failed extracting VM OVF from the OVF_STORE volume, falling back to initial vm.conf

Any hints how to fix this ? Of course a redeploy is possible, but I prefer to recover from that.

Which kind of storage are you using? can you please attach /var/log/ovirt-hosted-engine-ha/broker.log ?

Best Regards, Strahil Nikolov _______________________________________________ Users mailing list -- users@ovirt.org To unsubscribe send an email to users-leave@ovirt.org Privacy Statement: https://www.ovirt.org/site/privacy-policy/ oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/users@ovirt.org/message/OU3FKLEPH7AHT2...

_______________________________________________ Users mailing list -- users@ovirt.org To unsubscribe send an email to users-leave@ovirt.org Privacy Statement: https://www.ovirt.org/site/privacy-policy/ oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/users@ovirt.org/message/BNV7AVUBLOV2UD...

Strahil Nikolov

7 Mar 7 Mar

3:19 a.m.

Hi Simone, I think I found the problem - ovirt-ha cannot extract the file containing the needed data .In my case it is completely empty: [root@ovirt1 ~]# ll /rhev/data-center/mnt/glusterSD/ovirt1.localdomain:_engine/808423f9-8a5c-40cd-bc9f-2568c85b8c74/images/94ade632-6ecc-4901-8cec-8e39f3d69cb0total 66561-rw-rw----. 1 vdsm kvm 0 Mar 4 05:21 9460fc4b-54f3-48e3-b7b6-da962321ecf4-rw-rw----. 1 vdsm kvm 1048576 Jan 31 13:24 9460fc4b-54f3-48e3-b7b6-da962321ecf4.lease-rw-r--r--. 1 vdsm kvm 435 Mar 4 05:22 9460fc4b-54f3-48e3-b7b6-da962321ecf4.meta Any hint how to recreate that ? Maybe wipe and restart the ovirt-ha-broker and agent ? Also, I think this happened when I was upgrading ovirt1 (last in the gluster cluster) from 4.3.0 to 4.3.1 . The engine got restarted , because I forgot to enable the global maintenance. Best Regards,Strahil Nikolov В сряда, 6 март 2019 г., 16:57:30 ч. Гринуич+2, Simone Tiraboschi <stirabos@redhat.com> написа: On Wed, Mar 6, 2019 at 3:09 PM Strahil Nikolov <hunter86_bg@yahoo.com> wrote: Hi Simone, thanks for your reply.

...

Are you really sure that the issue was on the ping?>on storage errors the broker restart itself and while the broker is restarting >the agent cannot ask the broker to trigger the gateway monitor (the ping one) and >so that error message. It seemed so in that moment, but I'm not so sure , right now :) Which kind of storage are you using?>can you please attach /var/log/ovirt-hosted-engine-ha/broker.log ? I'm using glustervs v5 from ovirt 4.3.1 with FUSE mount.Please , have a look in the attached logs.

Nothing seems that strange there but that error.Can you please try with ovirt-ha-agent and ovirt-ha-broker in debug mode?you have to set level=DEBUG in [logger_root] section in /etc/ovirt-hosted-engine-ha/agent-log.conf and /etc/ovirt-hosted-engine-ha/broker-log.conf and restart the two services. Best Regards,Strahil Nikolov В сряда, 6 март 2019 г., 9:53:20 ч. Гринуич+2, Simone Tiraboschi <stirabos@redhat.com> написа: On Wed, Mar 6, 2019 at 6:13 AM Strahil <hunter86_bg@yahoo.com> wrote: Hi guys, After updating to 4.3.1 I had an issue where the ovirt-ha-broker was complaining that it couldn't ping the gateway. Are you really sure that the issue was on the ping?on storage errors the broker restart itself and while the broker is restarting the agent cannot ask the broker to trigger the gateway monitor (the ping one) and so that error message. As I have seen that before - I stopped ovirt-ha-agent, ovirt-ha-broker, vdsmd, supervdsmd and sanlock on the nodes and reinitialized the lockspace. I gues s I didn't do it properly as now I receive: ovirt-ha-agent ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine.config.vm ERROR Failed extracting VM OVF from the OVF_STORE volume, falling back to initial vm.conf Any hints how to fix this ? Of course a redeploy is possible, but I prefer to recover from that. Which kind of storage are you using?can you please attach /var/log/ovirt-hosted-engine-ha/broker.log ? Best Regards, Strahil Nikolov _______________________________________________ Users mailing list -- users@ovirt.org To unsubscribe send an email to users-leave@ovirt.org Privacy Statement: https://www.ovirt.org/site/privacy-policy/ oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/users@ovirt.org/message/OU3FKLEPH7AHT2... _______________________________________________ Users mailing list -- users@ovirt.org To unsubscribe send an email to users-leave@ovirt.org Privacy Statement: https://www.ovirt.org/site/privacy-policy/ oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/users@ovirt.org/message/BNV7AVUBLOV2UD...

Simone Tiraboschi

4:52 a.m.

On Thu, Mar 7, 2019 at 9:19 AM Strahil Nikolov <hunter86_bg@yahoo.com> wrote:

...

Hi Simone,

I think I found the problem - ovirt-ha cannot extract the file containing the needed data . In my case it is completely empty:

[root@ovirt1 ~]# ll /rhev/data-center/mnt/glusterSD/ovirt1.localdomain:_engine/808423f9-8a5c-40cd-bc9f-2568c85b8c74/images/94ade632-6ecc-4901-8cec-8e39f3d69cb0 total 66561 -rw-rw----. 1 vdsm kvm 0 Mar 4 05:21 9460fc4b-54f3-48e3-b7b6-da962321ecf4 -rw-rw----. 1 vdsm kvm 1048576 Jan 31 13:24 9460fc4b-54f3-48e3-b7b6-da962321ecf4.lease -rw-r--r--. 1 vdsm kvm 435 Mar 4 05:22 9460fc4b-54f3-48e3-b7b6-da962321ecf4.meta

Any hint how to recreate that ? Maybe wipe and restart the ovirt-ha-broker and agent ?

The OVF_STORE volume is going to get periodically recreated by the engine so at least you need a running engine. In order to avoid this kind of issue we have two OVF_STORE disks, in your case: MainThread::INFO::2019-03-06 06:50:02,391::ovf_store::120::ovirt_hosted_engine_ha.lib.ovf.ovf_store.OVFStore::(scan) Found OVF_STORE: imgUUID:441abdc8-6cb1-49a4-903f-a1ec0ed88429, volUUID:c3309fc0-8707-4de1-903d-8d4bbb024f81 MainThread::INFO::2019-03-06 06:50:02,748::ovf_store::120::ovirt_hosted_engine_ha.lib.ovf.ovf_store.OVFStore::(scan) Found OVF_STORE: imgUUID:94ade632-6ecc-4901-8cec-8e39f3d69cb0, volUUID:9460fc4b-54f3-48e3-b7b6-da962321ecf4 Can you please check if you have at lest the second copy? And even in the case you lost both, we are storing on the shared storage the initial vm.conf: MainThread::ERROR::2019-03-06 06:50:02,971::config_ovf::70::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine.config.vm::(_get_vm_conf_content_from_ovf_store) Failed extracting VM OVF from the OVF_STORE volume, falling back to initial vm.conf Can you please check what do you have in /var/run/ovirt-hosted-engine-ha/vm.conf ?

...

Also, I think this happened when I was upgrading ovirt1 (last in the gluster cluster) from 4.3.0 to 4.3.1 . The engine got restarted , because I forgot to enable the global maintenance.

Sorry, I don't understand. Can you please explain what happened?

...

Best Regards, Strahil Nikolov

В сряда, 6 март 2019 г., 16:57:30 ч. Гринуич+2, Simone Tiraboschi < stirabos@redhat.com> написа:

On Wed, Mar 6, 2019 at 3:09 PM Strahil Nikolov <hunter86_bg@yahoo.com> wrote:

Hi Simone,

thanks for your reply.

...
Are you really sure that the issue was on the ping? on storage errors the broker restart itself and while the broker is restarting >the agent cannot ask the broker to trigger the gateway monitor (the ping one) and >so that error message.

It seemed so in that moment, but I'm not so sure , right now :)

...
Which kind of storage are you using? can you please attach /var/log/ovirt-hosted-engine-ha/broker.log ?

I'm using glustervs v5 from ovirt 4.3.1 with FUSE mount. Please , have a look in the attached logs.

Nothing seems that strange there but that error. Can you please try with ovirt-ha-agent and ovirt-ha-broker in debug mode? you have to set level=DEBUG in [logger_root] section in /etc/ovirt-hosted-engine-ha/agent-log.conf and /etc/ovirt-hosted-engine-ha/broker-log.conf and restart the two services.

Best Regards, Strahil Nikolov

В сряда, 6 март 2019 г., 9:53:20 ч. Гринуич+2, Simone Tiraboschi < stirabos@redhat.com> написа:

On Wed, Mar 6, 2019 at 6:13 AM Strahil <hunter86_bg@yahoo.com> wrote:

Hi guys,

After updating to 4.3.1 I had an issue where the ovirt-ha-broker was complaining that it couldn't ping the gateway.

Are you really sure that the issue was on the ping? on storage errors the broker restart itself and while the broker is restarting the agent cannot ask the broker to trigger the gateway monitor (the ping one) and so that error message.

As I have seen that before - I stopped ovirt-ha-agent, ovirt-ha-broker, vdsmd, supervdsmd and sanlock on the nodes and reinitialized the lockspace.

I gues s I didn't do it properly as now I receive:

ovirt-ha-agent ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine.config.vm ERROR Failed extracting VM OVF from the OVF_STORE volume, falling back to initial vm.conf

Any hints how to fix this ? Of course a redeploy is possible, but I prefer to recover from that.

Which kind of storage are you using? can you please attach /var/log/ovirt-hosted-engine-ha/broker.log ?

Best Regards, Strahil Nikolov _______________________________________________ Users mailing list -- users@ovirt.org To unsubscribe send an email to users-leave@ovirt.org Privacy Statement: https://www.ovirt.org/site/privacy-policy/ oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/users@ovirt.org/message/OU3FKLEPH7AHT2...

_______________________________________________ Users mailing list -- users@ovirt.org To unsubscribe send an email to users-leave@ovirt.org Privacy Statement: https://www.ovirt.org/site/privacy-policy/ oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/users@ovirt.org/message/BNV7AVUBLOV2UD...

Strahil Nikolov

8:54 a.m.

...

The OVF_STORE volume is going to get periodically recreated by the engine so at least you need a running engine. In order to avoid this kind of issue we have two OVF_STORE disks, in your case: MainThread::INFO::2019-03-06 06:50:02,391::ovf_store::120::ovirt_hosted_engine_ha.lib.ovf.ovf_store.OVFStore::(scan) Found >OVF_STORE: imgUUID:441abdc8-6cb1-49a4-903f-a1ec0ed88429, volUUID:c3309fc0-8707-4de1-903d-8d4bbb024f81>MainThread::INFO::2019-03-06 06:50:02,748::ovf_store::120::ovirt_hosted_engine_ha.lib.ovf.ovf_store.OVFStore::(scan) Found >OVF_STORE: imgUUID:94ade632-6ecc-4901-8cec-8e39f3d69cb0, volUUID:9460fc4b-54f3-48e3-b7b6-da962321ecf4 Can you please check if you have at lest the second copy? Second Copy is empty too:[root@ovirt1 ~]# ll /rhev/data-center/mnt/glusterSD/ovirt1.localdomain:_engine/808423f9-8a5c-40cd-bc9f-2568c85b8c74/images/441abdc8-6cb1-49a4-903f-a1ec0ed88429 total 66561 -rw-rw----. 1 vdsm kvm 0 Mar 4 05:23 c3309fc0-8707-4de1-903d-8d4bbb024f81 -rw-rw----. 1 vdsm kvm 1048576 Jan 31 13:24 c3309fc0-8707-4de1-903d-8d4bbb024f81.lease -rw-r--r--. 1 vdsm kvm 435 Mar 4 05:24 c3309fc0-8707-4de1-903d-8d4bbb024f81.meta

...

And even in the case you lost both, we are storing on the shared storage the initial vm.conf:>MainThread::ERROR::2019-03-06 >06:50:02,971::config_ovf::70::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine.config.vm::>(_get_vm_conf_content_from_ovf_store) Failed extracting VM OVF from the OVF_STORE volume, falling back to initial vm.conf

...

Can you please check what do you have in /var/run/ovirt-hosted-engine-ha/vm.conf ? It exists and has the following: [root@ovirt1 ~]# cat /var/run/ovirt-hosted-engine-ha/vm.conf # Editing the hosted engine VM is only possible via the manager UI\API # This file was generated at Thu Mar 7 15:37:26 2019

vmId=8474ae07-f172-4a20-b516-375c73903df7 memSize=4096 display=vnc devices={index:2,iface:ide,address:{ controller:0, target:0,unit:0, bus:1, type:drive},specParams:{},readonly:true,deviceId:,path:,device:cdrom,shared:false,type:disk} devices={index:0,iface:virtio,format:raw,poolID:00000000-0000-0000-0000-000000000000,volumeID:a9ab832f-c4f2-4b9b-9d99-6393fd995979,imageID:8ec7a465-151e-4ac3-92a7-965ecf854501,specParams:{},readonly:false,domainID:808423f9-8a5c-40cd-bc9f-2568c85b8c74,optional:false,deviceId:a9ab832f-c4f2-4b9b-9d99-6393fd995979,address:{bus:0x00, slot:0x06, domain:0x0000, type:pci, function:0x0},device:disk,shared:exclusive,propagateErrors:off,type:disk,bootOrder:1} devices={device:scsi,model:virtio-scsi,type:controller} devices={nicModel:pv,macAddr:00:16:3e:62:72:c8,linkActive:true,network:ovirtmgmt,specParams:{},deviceId:,address:{bus:0x00, slot:0x03, domain:0x0000, type:pci, function:0x0},device:bridge,type:interface} devices={device:console,type:console} devices={device:vga,alias:video0,type:video} devices={device:vnc,type:graphics} vmName=HostedEngine spiceSecureChannels=smain,sdisplay,sinputs,scursor,splayback,srecord,ssmartcard,susbredir smp=1 maxVCpus=8 cpuType=Opteron_G5 emulatedMachine=emulated_machine_list.json['values']['system_option_value'][0]['value'].replace('[','').replace(']','').split(', ')|first devices={device:virtio,specParams:{source:urandom},model:virtio,type:rng} Also, I think this happened when I was upgrading ovirt1 (last in the gluster cluster) from 4.3.0 to 4.3.1 . The engine got restarted , because I forgot to enable the global maintenance.

...

Sorry, I don't understand>Can you please explain what happened? I have updated the engine first -> All OK, next was the arbiter -> again no issues with it.Next was the empty host -> ovirt2 and everything went OK.After that I migrated the engine to ovirt2 , and tried to updated ovirt1.The web showed that the installation failed, but using "yum update" was working.During the update via yum of ovirt1 -> the engine app crashed and restarted on ovirt2.After the reboot of ovirt1 I have noticed the error about pinging the gateway ,thus I stopped the engine and stopped the following services on both hosts (global maintenance):ovirt-ha-agent ovirt-ha-broker vdsmd supervdsmd sanlock Next was a reinitialization of the sanlock space via 'sanlock direct -s'. In the end I have managed to power on the hosted-engine and it was running for a while. As the errors did not stop - I have decided to shutdown everything, then power it up , heal gluster and check what will happen. Currently I'm not able to power up the engine:

[root@ovirt1 ovirt-hosted-engine-ha]# hosted-engine --vm-status !! Cluster is in GLOBAL MAINTENANCE mode !! --== Host ovirt1.localdomain (id: 1) status ==-- conf_on_shared_storage : True Status up-to-date : True Hostname : ovirt1.localdomain Host ID : 1 Engine status : {"reason": "vm not running on this host", "health": "bad", "vm": "down", "detail": "unknown"} Score : 3400 stopped : False Local maintenance : False crc32 : 45e6772b local_conf_timestamp : 288 Host timestamp : 287 Extra metadata (valid at timestamp): metadata_parse_version=1 metadata_feature_version=1 timestamp=287 (Thu Mar 7 15:34:06 2019) host-id=1 score=3400 vm_conf_refresh_time=288 (Thu Mar 7 15:34:07 2019) conf_on_shared_storage=True maintenance=False state=GlobalMaintenance stopped=False --== Host ovirt2.localdomain (id: 2) status ==-- conf_on_shared_storage : True Status up-to-date : True Hostname : ovirt2.localdomain Host ID : 2 Engine status : {"reason": "vm not running on this host", "health": "bad", "vm": "down", "detail": "unknown"} Score : 3400 stopped : False Local maintenance : False crc32 : 2e9a0444 local_conf_timestamp : 3886 Host timestamp : 3885 Extra metadata (valid at timestamp): metadata_parse_version=1 metadata_feature_version=1 timestamp=3885 (Thu Mar 7 15:34:05 2019) host-id=2 score=3400 vm_conf_refresh_time=3886 (Thu Mar 7 15:34:06 2019) conf_on_shared_storage=True maintenance=False state=GlobalMaintenance stopped=False !! Cluster is in GLOBAL MAINTENANCE mode !! [root@ovirt1 ovirt-hosted-engine-ha]# hosted-engine --vm-start Command VM.getStats with args {'vmID': '8474ae07-f172-4a20-b516-375c73903df7'} failed: (code=1, message=Virtual machine does not exist: {'vmId': u'8474ae07-f172-4a20-b516-375c73903df7'}) [root@ovirt1 ovirt-hosted-engine-ha]# hosted-engine --vm-start VM exists and is down, cleaning up and restarting [root@ovirt1 ovirt-hosted-engine-ha]# hosted-engine --vm-status !! Cluster is in GLOBAL MAINTENANCE mode !! --== Host ovirt1.localdomain (id: 1) status ==-- conf_on_shared_storage : True Status up-to-date : True Hostname : ovirt1.localdomain Host ID : 1 Engine status : {"reason": "bad vm status", "health": "bad", "vm": "down", "detail": "Down"} Score : 3400 stopped : False Local maintenance : False crc32 : 6b086b7c local_conf_timestamp : 328 Host timestamp : 327 Extra metadata (valid at timestamp): metadata_parse_version=1 metadata_feature_version=1 timestamp=327 (Thu Mar 7 15:34:46 2019) host-id=1 score=3400 vm_conf_refresh_time=328 (Thu Mar 7 15:34:47 2019) conf_on_shared_storage=True maintenance=False state=GlobalMaintenance stopped=False --== Host ovirt2.localdomain (id: 2) status ==-- conf_on_shared_storage : True Status up-to-date : True Hostname : ovirt2.localdomain Host ID : 2 Engine status : {"reason": "vm not running on this host", "health": "bad", "vm": "down", "detail": "unknown"} Score : 3400 stopped : False Local maintenance : False crc32 : c5890e9c local_conf_timestamp : 3926 Host timestamp : 3925 Extra metadata (valid at timestamp): metadata_parse_version=1 metadata_feature_version=1 timestamp=3925 (Thu Mar 7 15:34:45 2019) host-id=2 score=3400 vm_conf_refresh_time=3926 (Thu Mar 7 15:34:45 2019) conf_on_shared_storage=True maintenance=False state=GlobalMaintenance stopped=False !! Cluster is in GLOBAL MAINTENANCE mode !! [root@ovirt1 ovirt-hosted-engine-ha]# virsh list --all Id Name State ---------------------------------------------------- - HostedEngine shut off I am really puzzled why both volumes are wiped out . Best Regards,Strahil Nikolov

Simone Tiraboschi

10:54 a.m.

On Thu, Mar 7, 2019 at 2:54 PM Strahil Nikolov <hunter86_bg@yahoo.com> wrote:

...

...
The OVF_STORE volume is going to get periodically recreated by the engine so at least you need a running engine.

...
In order to avoid this kind of issue we have two OVF_STORE disks, in your case:

...
MainThread::INFO::2019-03-06 06:50:02,391::ovf_store::120::ovirt_hosted_engine_ha.lib.ovf.ovf_store.OVFStore::(scan) Found >OVF_STORE: imgUUID:441abdc8-6cb1-49a4-903f-a1ec0ed88429, volUUID:c3309fc0-8707-4de1-903d-8d4bbb024f81 MainThread::INFO::2019-03-06 06:50:02,748::ovf_store::120::ovirt_hosted_engine_ha.lib.ovf.ovf_store.OVFStore::(scan) Found >OVF_STORE: imgUUID:94ade632-6ecc-4901-8cec-8e39f3d69cb0, volUUID:9460fc4b-54f3-48e3-b7b6-da962321ecf4

...
Can you please check if you have at lest the second copy?

Second Copy is empty too: [root@ovirt1 ~]# ll /rhev/data-center/mnt/glusterSD/ovirt1.localdomain:_engine/808423f9-8a5c-40cd-bc9f-2568c85b8c74/images/441abdc8-6cb1-49a4-903f-a1ec0ed88429 total 66561 -rw-rw----. 1 vdsm kvm 0 Mar 4 05:23 c3309fc0-8707-4de1-903d-8d4bbb024f81 -rw-rw----. 1 vdsm kvm 1048576 Jan 31 13:24 c3309fc0-8707-4de1-903d-8d4bbb024f81.lease -rw-r--r--. 1 vdsm kvm 435 Mar 4 05:24 c3309fc0-8707-4de1-903d-8d4bbb024f81.meta

...
And even in the case you lost both, we are storing on the shared storage the initial vm.conf: MainThread::ERROR::2019-03-06 06:50:02,971::config_ovf::70::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine.config.vm::>(_get_vm_conf_content_from_ovf_store) Failed extracting VM OVF from the OVF_STORE volume, falling back to initial vm.conf

...
Can you please check what do you have in /var/run/ovirt-hosted-engine-ha/vm.conf ?

It exists and has the following:

[root@ovirt1 ~]# cat /var/run/ovirt-hosted-engine-ha/vm.conf # Editing the hosted engine VM is only possible via the manager UI\API # This file was generated at Thu Mar 7 15:37:26 2019

vmId=8474ae07-f172-4a20-b516-375c73903df7 memSize=4096 display=vnc devices={index:2,iface:ide,address:{ controller:0, target:0,unit:0, bus:1, type:drive},specParams:{},readonly:true,deviceId:,path:,device:cdrom,shared:false,type:disk} devices={index:0,iface:virtio,format:raw,poolID:00000000-0000-0000-0000-000000000000,volumeID:a9ab832f-c4f2-4b9b-9d99-6393fd995979,imageID:8ec7a465-151e-4ac3-92a7-965ecf854501,specParams:{},readonly:false,domainID:808423f9-8a5c-40cd-bc9f-2568c85b8c74,optional:false,deviceId:a9ab832f-c4f2-4b9b-9d99-6393fd995979,address:{bus:0x00, slot:0x06, domain:0x0000, type:pci, function:0x0},device:disk,shared:exclusive,propagateErrors:off,type:disk,bootOrder:1} devices={device:scsi,model:virtio-scsi,type:controller} devices={nicModel:pv,macAddr:00:16:3e:62:72:c8,linkActive:true,network:ovirtmgmt,specParams:{},deviceId:,address:{bus:0x00, slot:0x03, domain:0x0000, type:pci, function:0x0},device:bridge,type:interface} devices={device:console,type:console} devices={device:vga,alias:video0,type:video} devices={device:vnc,type:graphics} vmName=HostedEngine

spiceSecureChannels=smain,sdisplay,sinputs,scursor,splayback,srecord,ssmartcard,susbredir smp=1 maxVCpus=8 cpuType=Opteron_G5 emulatedMachine=emulated_machine_list.json['values']['system_option_value'][0]['value'].replace('[','').replace(']','').split(', ')|first devices={device:virtio,specParams:{source:urandom},model:virtio,type:rng}

You should be able to copy it to /root/myvm.conf.xml and start the engine VM with hosted-engine --vm-start --vm-conf=/root/myvm.conf

...

Also, I think this happened when I was upgrading ovirt1 (last in the gluster cluster) from 4.3.0 to 4.3.1 . The engine got restarted , because I forgot to enable the global maintenance.

...
Sorry, I don't understand Can you please explain what happened?

I have updated the engine first -> All OK, next was the arbiter -> again no issues with it. Next was the empty host -> ovirt2 and everything went OK. After that I migrated the engine to ovirt2 , and tried to updated ovirt1. The web showed that the installation failed, but using "yum update" was working. During the update via yum of ovirt1 -> the engine app crashed and restarted on ovirt2. After the reboot of ovirt1 I have noticed the error about pinging the gateway ,thus I stopped the engine and stopped the following services on both hosts (global maintenance): ovirt-ha-agent ovirt-ha-broker vdsmd supervdsmd sanlock

Next was a reinitialization of the sanlock space via 'sanlock direct -s'. In the end I have managed to power on the hosted-engine and it was running for a while.

As the errors did not stop - I have decided to shutdown everything, then power it up , heal gluster and check what will happen.

Currently I'm not able to power up the engine:

[root@ovirt1 ovirt-hosted-engine-ha]# hosted-engine --vm-status

!! Cluster is in GLOBAL MAINTENANCE mode !!

Please notice that in global maintenance mode nothing will try to start the engine VM for you. I assume you tried to exit global maintenance mode or at least you tried to manually start it with hosted-engine --vm-start, right?

...

--== Host ovirt1.localdomain (id: 1) status ==--

conf_on_shared_storage : True Status up-to-date : True Hostname : ovirt1.localdomain Host ID : 1 Engine status : {"reason": "vm not running on this host", "health": "bad", "vm": "down", "detail": "unknown"} Score : 3400 stopped : False Local maintenance : False crc32 : 45e6772b local_conf_timestamp : 288 Host timestamp : 287 Extra metadata (valid at timestamp): metadata_parse_version=1 metadata_feature_version=1 timestamp=287 (Thu Mar 7 15:34:06 2019) host-id=1 score=3400 vm_conf_refresh_time=288 (Thu Mar 7 15:34:07 2019) conf_on_shared_storage=True maintenance=False state=GlobalMaintenance stopped=False

--== Host ovirt2.localdomain (id: 2) status ==--

conf_on_shared_storage : True Status up-to-date : True Hostname : ovirt2.localdomain Host ID : 2 Engine status : {"reason": "vm not running on this host", "health": "bad", "vm": "down", "detail": "unknown"} Score : 3400 stopped : False Local maintenance : False crc32 : 2e9a0444 local_conf_timestamp : 3886 Host timestamp : 3885 Extra metadata (valid at timestamp): metadata_parse_version=1 metadata_feature_version=1 timestamp=3885 (Thu Mar 7 15:34:05 2019) host-id=2 score=3400 vm_conf_refresh_time=3886 (Thu Mar 7 15:34:06 2019) conf_on_shared_storage=True maintenance=False state=GlobalMaintenance stopped=False

!! Cluster is in GLOBAL MAINTENANCE mode !!

[root@ovirt1 ovirt-hosted-engine-ha]# hosted-engine --vm-start Command VM.getStats with args {'vmID': '8474ae07-f172-4a20-b516-375c73903df7'} failed: (code=1, message=Virtual machine does not exist: {'vmId': u'8474ae07-f172-4a20-b516-375c73903df7'}) [root@ovirt1 ovirt-hosted-engine-ha]# hosted-engine --vm-start VM exists and is down, cleaning up and restarting

[root@ovirt1 ovirt-hosted-engine-ha]# hosted-engine --vm-status

!! Cluster is in GLOBAL MAINTENANCE mode !!

--== Host ovirt1.localdomain (id: 1) status ==--

conf_on_shared_storage : True Status up-to-date : True Hostname : ovirt1.localdomain Host ID : 1 Engine status : {"reason": "bad vm status", "health": "bad", "vm": "down", "detail": "Down"} Score : 3400 stopped : False Local maintenance : False crc32 : 6b086b7c local_conf_timestamp : 328 Host timestamp : 327 Extra metadata (valid at timestamp): metadata_parse_version=1 metadata_feature_version=1 timestamp=327 (Thu Mar 7 15:34:46 2019) host-id=1 score=3400 vm_conf_refresh_time=328 (Thu Mar 7 15:34:47 2019) conf_on_shared_storage=True maintenance=False state=GlobalMaintenance stopped=False

--== Host ovirt2.localdomain (id: 2) status ==--

conf_on_shared_storage : True Status up-to-date : True Hostname : ovirt2.localdomain Host ID : 2 Engine status : {"reason": "vm not running on this host", "health": "bad", "vm": "down", "detail": "unknown"} Score : 3400 stopped : False Local maintenance : False crc32 : c5890e9c local_conf_timestamp : 3926 Host timestamp : 3925 Extra metadata (valid at timestamp): metadata_parse_version=1 metadata_feature_version=1 timestamp=3925 (Thu Mar 7 15:34:45 2019) host-id=2 score=3400 vm_conf_refresh_time=3926 (Thu Mar 7 15:34:45 2019) conf_on_shared_storage=True maintenance=False state=GlobalMaintenance stopped=False

!! Cluster is in GLOBAL MAINTENANCE mode !!

[root@ovirt1 ovirt-hosted-engine-ha]# virsh list --all Id Name State ---------------------------------------------------- - HostedEngine shut off

I am really puzzled why both volumes are wiped out .

This is really scaring: can you please double check gluster logs for warning and errors?

...

Best Regards, Strahil Nikolov

Strahil Nikolov

8 Mar 8 Mar

6:49 a.m.

Hi Simone, sadly it seems that starting the engine from an alternative config is not working.Virsh reports that the VM is defined , but shut down and the dumpxml doesn't show any disks - maybe this is normal for oVirt (I have never checked a running VM). Strangely , both OVF have been wiped out at almost the same time. I'm attaching some console output and gluster logs. In the gluster stuff I can see:glusterd.log The message "E [MSGID: 101191] [event-epoll.c:671:event_dispatch_epoll_worker] 0-epoll: Failed to dispatch handler" repeated 47 times between [2019-03-04 05: 17:15.810686] and [2019-03-04 05:19:08.576724] [2019-03-04 05:19:16.147795] I [MSGID: 106488] [glusterd-handler.c:1558:__glusterd_handle_cli_get_volume] 0-management: Received get vol req [2019-03-04 05:19:16.149524] E [MSGID: 101191] [event-epoll.c:671:event_dispatch_epoll_worker] 0-epoll: Failed to dispatch handler [2019-03-04 05:19:42.728693] E [MSGID: 106419] [glusterd-utils.c:6943:glusterd_add_inode_size_to_dict] 0-management: could not find (null) to getinode size f or systemd-1 (autofs): (null) package missing? [2019-03-04 05:20:54.236659] I [MSGID: 106499] [glusterd-handler.c:4389:__glusterd_handle_status_volume] 0-management: Received status volume req for volume data [2019-03-04 05:20:54.245844] I [MSGID: 106499] [glusterd-handler.c:4389:__glusterd_handle_status_volume] 0-management: Received status volume req for volume engine and the log of the mountpoint: [2019-03-04 05:19:35.381378] E [MSGID: 101191] [event-epoll.c:671:event_dispatch_epoll_worker] 0-epoll: Failed to dispatch handler [2019-03-04 05:19:37.294931] I [MSGID: 108031] [afr-common.c:2543:afr_local_discovery_cbk] 0-engine-replicate-0: selecting local read_child engine-client-0 The message "I [MSGID: 108031] [afr-common.c:2543:afr_local_discovery_cbk] 0-engine-replicate-0: selecting local read_child engine-client-0" repeated 7 times between [2019-03-04 05:19:37.294931] and [2019-03-04 05:21:26.171701] The message "E [MSGID: 101191] [event-epoll.c:671:event_dispatch_epoll_worker] 0-epoll: Failed to dispatch handler" repeated 865 times between [2019-03-04 05 :19:35.381378] and [2019-03-04 05:21:26.233004] [2019-03-04 05:21:35.699082] E [MSGID: 101191] [event-epoll.c:671:event_dispatch_epoll_worker] 0-epoll: Failed to dispatch handler [2019-03-04 05:21:38.671811] I [MSGID: 108031] [afr-common.c:2543:afr_local_discovery_cbk] 0-engine-replicate-0: selecting local read_child engine-client-0 The message "I [MSGID: 108031] [afr-common.c:2543:afr_local_discovery_cbk] 0-engine-replicate-0: selecting local read_child engine-client-0" repeated 7 times between [2019-03-04 05:21:38.671811] and [2019-03-04 05:23:31.654205] The message "E [MSGID: 101191] [event-epoll.c:671:event_dispatch_epoll_worker] 0-epoll: Failed to dispatch handler" repeated 889 times between [2019-03-04 05 :21:35.699082] and [2019-03-04 05:23:32.613797] Best Regards,Strahil Nikolov

Simone Tiraboschi

10:47 a.m.

On Fri, Mar 8, 2019 at 12:49 PM Strahil Nikolov <hunter86_bg@yahoo.com> wrote:

...

Hi Simone,

sadly it seems that starting the engine from an alternative config is not working. Virsh reports that the VM is defined , but shut down and the dumpxml doesn't show any disks - maybe this is normal for oVirt (I have never checked a running VM).

No, it's not: devices={index:0,iface:virtio,format:raw,poolID:00000000- 0000-0000-0000-000000000000,volumeID:a9ab832f-c4f2-4b9b- 9d99-6393fd995979,imageID:8ec7a465-151e-4ac3-92a7- 965ecf854501,specParams:{},readonly:false,domainID:808423f9-8a5c-40cd-bc9f- 2568c85b8c74,optional:false,deviceId:a9ab832f-c4f2-4b9b- 9d99-6393fd995979,address:{bus:0x00, slot:0x06, the disk was definitively there in vm.conf but then the VM ignores it. I'd suggest to double check vdsm.log for errors or something like that.

...

Strangely , both OVF have been wiped out at almost the same time.

I'm attaching some console output and gluster logs. In the gluster stuff I can see: glusterd.log

The message "E [MSGID: 101191] [event-epoll.c:671:event_dispatch_epoll_worker] 0-epoll: Failed to dispatch handler" repeated 47 times between [2019-03-04 05: 17:15.810686] and [2019-03-04 05:19:08.576724] [2019-03-04 05:19:16.147795] I [MSGID: 106488] [glusterd-handler.c:1558:__glusterd_handle_cli_get_volume] 0-management: Received get vol req [2019-03-04 05:19:16.149524] E [MSGID: 101191] [event-epoll.c:671:event_dispatch_epoll_worker] 0-epoll: Failed to dispatch handler [2019-03-04 05:19:42.728693] E [MSGID: 106419] [glusterd-utils.c:6943:glusterd_add_inode_size_to_dict] 0-management: could not find (null) to getinode size f or systemd-1 (autofs): (null) package missing? [2019-03-04 05:20:54.236659] I [MSGID: 106499] [glusterd-handler.c:4389:__glusterd_handle_status_volume] 0-management: Received status volume req for volume data [2019-03-04 05:20:54.245844] I [MSGID: 106499] [glusterd-handler.c:4389:__glusterd_handle_status_volume] 0-management: Received status volume req for volume engine

and the log of the mountpoint:

[2019-03-04 05:19:35.381378] E [MSGID: 101191] [event-epoll.c:671:event_dispatch_epoll_worker] 0-epoll: Failed to dispatch handler [2019-03-04 05:19:37.294931] I [MSGID: 108031] [afr-common.c:2543:afr_local_discovery_cbk] 0-engine-replicate-0: selecting local read_child engine-client-0 The message "I [MSGID: 108031] [afr-common.c:2543:afr_local_discovery_cbk] 0-engine-replicate-0: selecting local read_child engine-client-0" repeated 7 times between [2019-03-04 05:19:37.294931] and [2019-03-04 05:21:26.171701] The message "E [MSGID: 101191] [event-epoll.c:671:event_dispatch_epoll_worker] 0-epoll: Failed to dispatch handler" repeated 865 times between [2019-03-04 05 :19:35.381378] and [2019-03-04 05:21:26.233004] [2019-03-04 05:21:35.699082] E [MSGID: 101191] [event-epoll.c:671:event_dispatch_epoll_worker] 0-epoll: Failed to dispatch handler [2019-03-04 05:21:38.671811] I [MSGID: 108031] [afr-common.c:2543:afr_local_discovery_cbk] 0-engine-replicate-0: selecting local read_child engine-client-0 The message "I [MSGID: 108031] [afr-common.c:2543:afr_local_discovery_cbk] 0-engine-replicate-0: selecting local read_child engine-client-0" repeated 7 times between [2019-03-04 05:21:38.671811] and [2019-03-04 05:23:31.654205] The message "E [MSGID: 101191] [event-epoll.c:671:event_dispatch_epoll_worker] 0-epoll: Failed to dispatch handler" repeated 889 times between [2019-03-04 05 :21:35.699082] and [2019-03-04 05:23:32.613797]

Adding also Sahina here.

...

Best Regards, Strahil Nikolov

Strahil Nikolov

9 Mar 9 Mar

10 a.m.

Hi Simone, and thanks for your help. So far I found out that there is some problem with the local copy of the HostedEngine config (see attached part of vdsm.log). I have found out an older xml configuration (in an old vdsm.log) and defining the VM works, but powering it on reports: [root@ovirt1 ~]# virsh define hosted-engine.xmlDomain HostedEngine defined from hosted-engine.xml [root@ovirt1 ~]# virsh list --all Id Name State---------------------------------------------------- - HostedEngine shut off [root@ovirt1 ~]# virsh start HostedEngineerror: Failed to start domain HostedEngineerror: Network not found: no network with matching name 'vdsm-ovirtmgmt' [root@ovirt1 ~]# virsh net-list --all Name State Autostart Persistent---------------------------------------------------------- ;vdsmdummy; active no no default inactive no yes [root@ovirt1 ~]# brctl showbridge name bridge id STP enabled interfaces;vdsmdummy; 8000.000000000000 noovirtmgmt 8000.bc5ff467f5b3 no enp2s0 [root@ovirt1 ~]# ip a s1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000 link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 inet 127.0.0.1/8 scope host lo valid_lft forever preferred_lft forever inet6 ::1/128 scope host valid_lft forever preferred_lft forever2: enp2s0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc mq master ovirtmgmt state UP group default qlen 1000 link/ether bc:5f:f4:67:f5:b3 brd ff:ff:ff:ff:ff:ff3: ovs-system: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN group default qlen 1000 link/ether f6:78:c7:2d:32:f9 brd ff:ff:ff:ff:ff:ff4: br-int: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN group default qlen 1000 link/ether 66:36:dd:63:dc:48 brd ff:ff:ff:ff:ff:ff20: ovirtmgmt: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc noqueue state UP group default qlen 1000 link/ether bc:5f:f4:67:f5:b3 brd ff:ff:ff:ff:ff:ff inet 192.168.1.90/24 brd 192.168.1.255 scope global ovirtmgmt valid_lft forever preferred_lft forever inet 192.168.1.243/24 brd 192.168.1.255 scope global secondary ovirtmgmt valid_lft forever preferred_lft forever inet6 fe80::be5f:f4ff:fe67:f5b3/64 scope link valid_lft forever preferred_lft forever21: ;vdsmdummy;: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN group default qlen 1000 link/ether ce:36:8d:b7:64:bd brd ff:ff:ff:ff:ff:ff 192.168.1.243/24 is the one of the IPs in ctdb.. So , now comes the question - is there an xml in the logs that defines the network ?My hope is to power up the HostedEngine properly and hope that it will push all the configurations to the right places ... maybe this is way too optimistic. At least I have learned a lot for oVirt. Best Regards,Strahil Nikolov В четвъртък, 7 март 2019 г., 17:55:12 ч. Гринуич+2, Simone Tiraboschi <stirabos@redhat.com> написа: On Thu, Mar 7, 2019 at 2:54 PM Strahil Nikolov <hunter86_bg@yahoo.com> wrote:

...

The OVF_STORE volume is going to get periodically recreated by the engine so at least you need a running engine. In order to avoid this kind of issue we have two OVF_STORE disks, in your case: MainThread::INFO::2019-03-06 06:50:02,391::ovf_store::120::ovirt_hosted_engine_ha.lib.ovf.ovf_store.OVFStore::(scan) Found >OVF_STORE: imgUUID:441abdc8-6cb1-49a4-903f-a1ec0ed88429, volUUID:c3309fc0-8707-4de1-903d-8d4bbb024f81>MainThread::INFO::2019-03-06 06:50:02,748::ovf_store::120::ovirt_hosted_engine_ha.lib.ovf.ovf_store.OVFStore::(scan) Found >OVF_STORE: imgUUID:94ade632-6ecc-4901-8cec-8e39f3d69cb0, volUUID:9460fc4b-54f3-48e3-b7b6-da962321ecf4 Can you please check if you have at lest the second copy? Second Copy is empty too:[root@ovirt1 ~]# ll /rhev/data-center/mnt/glusterSD/ovirt1.localdomain:_engine/808423f9-8a5c-40cd-bc9f-2568c85b8c74/images/441abdc8-6cb1-49a4-903f-a1ec0ed88429 total 66561 -rw-rw----. 1 vdsm kvm 0 Mar 4 05:23 c3309fc0-8707-4de1-903d-8d4bbb024f81 -rw-rw----. 1 vdsm kvm 1048576 Jan 31 13:24 c3309fc0-8707-4de1-903d-8d4bbb024f81.lease -rw-r--r--. 1 vdsm kvm 435 Mar 4 05:24 c3309fc0-8707-4de1-903d-8d4bbb024f81.meta

...

And even in the case you lost both, we are storing on the shared storage the initial vm.conf:>MainThread::ERROR::2019-03-06 >06:50:02,971::config_ovf::70::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine.config.vm::>(_get_vm_conf_content_from_ovf_store) Failed extracting VM OVF from the OVF_STORE volume, falling back to initial vm.conf

...

Can you please check what do you have in /var/run/ovirt-hosted-engine-ha/vm.conf ? It exists and has the following: [root@ovirt1 ~]# cat /var/run/ovirt-hosted-engine-ha/vm.conf # Editing the hosted engine VM is only possible via the manager UI\API # This file was generated at Thu Mar 7 15:37:26 2019

vmId=8474ae07-f172-4a20-b516-375c73903df7 memSize=4096 display=vnc devices={index:2,iface:ide,address:{ controller:0, target:0,unit:0, bus:1, type:drive},specParams:{},readonly:true,deviceId:,path:,device:cdrom,shared:false,type:disk} devices={index:0,iface:virtio,format:raw,poolID:00000000-0000-0000-0000-000000000000,volumeID:a9ab832f-c4f2-4b9b-9d99-6393fd995979,imageID:8ec7a465-151e-4ac3-92a7-965ecf854501,specParams:{},readonly:false,domainID:808423f9-8a5c-40cd-bc9f-2568c85b8c74,optional:false,deviceId:a9ab832f-c4f2-4b9b-9d99-6393fd995979,address:{bus:0x00, slot:0x06, domain:0x0000, type:pci, function:0x0},device:disk,shared:exclusive,propagateErrors:off,type:disk,bootOrder:1} devices={device:scsi,model:virtio-scsi,type:controller} devices={nicModel:pv,macAddr:00:16:3e:62:72:c8,linkActive:true,network:ovirtmgmt,specParams:{},deviceId:,address:{bus:0x00, slot:0x03, domain:0x0000, type:pci, function:0x0},device:bridge,type:interface} devices={device:console,type:console} devices={device:vga,alias:video0,type:video} devices={device:vnc,type:graphics} vmName=HostedEngine spiceSecureChannels=smain,sdisplay,sinputs,scursor,splayback,srecord,ssmartcard,susbredir smp=1 maxVCpus=8 cpuType=Opteron_G5 emulatedMachine=emulated_machine_list.json['values']['system_option_value'][0]['value'].replace('[','').replace(']','').split(', ')|first devices={device:virtio,specParams:{source:urandom},model:virtio,type:rng} You should be able to copy it to /root/myvm.conf.xml and start the engine VM withhosted-engine --vm-start --vm-conf=/root/myvm.conf Also, I think this happened when I was upgrading ovirt1 (last in the gluster cluster) from 4.3.0 to 4.3.1 . The engine got restarted , because I forgot to enable the global maintenance.

...

Sorry, I don't understand>Can you please explain what happened? I have updated the engine first -> All OK, next was the arbiter -> again no issues with it.Next was the empty host -> ovirt2 and everything went OK.After that I migrated the engine to ovirt2 , and tried to updated ovirt1.The web showed that the installation failed, but using "yum update" was working.During the update via yum of ovirt1 -> the engine app crashed and restarted on ovirt2.After the reboot of ovirt1 I have noticed the error about pinging the gateway ,thus I stopped the engine and stopped the following services on both hosts (global maintenance):ovirt-ha-agent ovirt-ha-broker vdsmd supervdsmd sanlock Next was a reinitialization of the sanlock space via 'sanlock direct -s'. In the end I have managed to power on the hosted-engine and it was running for a while. As the errors did not stop - I have decided to shutdown everything, then power it up , heal gluster and check what will happen. Currently I'm not able to power up the engine:

[root@ovirt1 ovirt-hosted-engine-ha]# hosted-engine --vm-status !! Cluster is in GLOBAL MAINTENANCE mode !! Please notice that in global maintenance mode nothing will try to start the engine VM for you.I assume you tried to exit global maintenance mode or at least you tried to manually start it with hosted-engine --vm-start, right? --== Host ovirt1.localdomain (id: 1) status ==-- conf_on_shared_storage : True Status up-to-date : True Hostname : ovirt1.localdomain Host ID : 1 Engine status : {"reason": "vm not running on this host", "health": "bad", "vm": "down", "detail": "unknown"} Score : 3400 stopped : False Local maintenance : False crc32 : 45e6772b local_conf_timestamp : 288 Host timestamp : 287 Extra metadata (valid at timestamp): metadata_parse_version=1 metadata_feature_version=1 timestamp=287 (Thu Mar 7 15:34:06 2019) host-id=1 score=3400 vm_conf_refresh_time=288 (Thu Mar 7 15:34:07 2019) conf_on_shared_storage=True maintenance=False state=GlobalMaintenance stopped=False --== Host ovirt2.localdomain (id: 2) status ==-- conf_on_shared_storage : True Status up-to-date : True Hostname : ovirt2.localdomain Host ID : 2 Engine status : {"reason": "vm not running on this host", "health": "bad", "vm": "down", "detail": "unknown"} Score : 3400 stopped : False Local maintenance : False crc32 : 2e9a0444 local_conf_timestamp : 3886 Host timestamp : 3885 Extra metadata (valid at timestamp): metadata_parse_version=1 metadata_feature_version=1 timestamp=3885 (Thu Mar 7 15:34:05 2019) host-id=2 score=3400 vm_conf_refresh_time=3886 (Thu Mar 7 15:34:06 2019) conf_on_shared_storage=True maintenance=False state=GlobalMaintenance stopped=False !! Cluster is in GLOBAL MAINTENANCE mode !! [root@ovirt1 ovirt-hosted-engine-ha]# hosted-engine --vm-start Command VM.getStats with args {'vmID': '8474ae07-f172-4a20-b516-375c73903df7'} failed: (code=1, message=Virtual machine does not exist: {'vmId': u'8474ae07-f172-4a20-b516-375c73903df7'}) [root@ovirt1 ovirt-hosted-engine-ha]# hosted-engine --vm-start VM exists and is down, cleaning up and restarting [root@ovirt1 ovirt-hosted-engine-ha]# hosted-engine --vm-status !! Cluster is in GLOBAL MAINTENANCE mode !! --== Host ovirt1.localdomain (id: 1) status ==-- conf_on_shared_storage : True Status up-to-date : True Hostname : ovirt1.localdomain Host ID : 1 Engine status : {"reason": "bad vm status", "health": "bad", "vm": "down", "detail": "Down"} Score : 3400 stopped : False Local maintenance : False crc32 : 6b086b7c local_conf_timestamp : 328 Host timestamp : 327 Extra metadata (valid at timestamp): metadata_parse_version=1 metadata_feature_version=1 timestamp=327 (Thu Mar 7 15:34:46 2019) host-id=1 score=3400 vm_conf_refresh_time=328 (Thu Mar 7 15:34:47 2019) conf_on_shared_storage=True maintenance=False state=GlobalMaintenance stopped=False --== Host ovirt2.localdomain (id: 2) status ==-- conf_on_shared_storage : True Status up-to-date : True Hostname : ovirt2.localdomain Host ID : 2 Engine status : {"reason": "vm not running on this host", "health": "bad", "vm": "down", "detail": "unknown"} Score : 3400 stopped : False Local maintenance : False crc32 : c5890e9c local_conf_timestamp : 3926 Host timestamp : 3925 Extra metadata (valid at timestamp): metadata_parse_version=1 metadata_feature_version=1 timestamp=3925 (Thu Mar 7 15:34:45 2019) host-id=2 score=3400 vm_conf_refresh_time=3926 (Thu Mar 7 15:34:45 2019) conf_on_shared_storage=True maintenance=False state=GlobalMaintenance stopped=False !! Cluster is in GLOBAL MAINTENANCE mode !! [root@ovirt1 ovirt-hosted-engine-ha]# virsh list --all Id Name State ---------------------------------------------------- - HostedEngine shut off I am really puzzled why both volumes are wiped out . This is really scaring: can you please double check gluster logs for warning and errors? Best Regards,Strahil Nikolov

Strahil Nikolov

10:05 p.m.

Hello again, Latest update: the engine is up and running (or at least the login portal). [root@ovirt1 ~]# hosted-engine --check-livelinessHosted Engine is up! I have found online the xml for the network: [root@ovirt1 ~]# cat ovirtmgmt_net.xml <network> <name>vdsm-ovirtmgmt</name> <forward mode='bridge'/> <bridge name='ovirtmgmt'/> </network> Sadly, I had to create a symbolic link to the main disk in /var/run/vdsm/storage , as it was missing. So, what's next. Issues up to now:2 OVF - 0 bytesProblem with local copy of the HostedEngine config - used xml from an old vdsm logMissing vdsm-ovirtmgmt definitionNo link for the main raw disk in /var/run/vdsm/storage . Can you hint me how to recover the 2 OVF tars now ? Best Regards,Strahil Nikolov

Strahil Nikolov

12 Mar 12 Mar

4:48 a.m.

Latest update - the system is back and running normally.After a day (or maybe a little more), the OVF is OK: [root@ovirt1 ~]# ls -l /rhev/data-center/mnt/glusterSD/ovirt1.localdomain\:_engine/808423f9-8a5c-40cd-bc9f-2568c85b8c74/images/{441abdc8-6cb1-49a4-903f-a1ec0ed88429,94ade632-6ecc-4901-8cec-8e39f3d69cb0} /rhev/data-center/mnt/glusterSD/ovirt1.localdomain:_engine/808423f9-8a5c-40cd-bc9f-2568c85b8c74/images/441abdc8-6cb1-49a4-903f-a1ec0ed88429: total 66591 -rw-rw----. 1 vdsm kvm 30720 Mar 12 08:06 c3309fc0-8707-4de1-903d-8d4bbb024f81 -rw-rw----. 1 vdsm kvm 1048576 Jan 31 13:24 c3309fc0-8707-4de1-903d-8d4bbb024f81.lease -rw-r--r--. 1 vdsm kvm 435 Mar 12 08:06 c3309fc0-8707-4de1-903d-8d4bbb024f81.meta /rhev/data-center/mnt/glusterSD/ovirt1.localdomain:_engine/808423f9-8a5c-40cd-bc9f-2568c85b8c74/images/94ade632-6ecc-4901-8cec-8e39f3d69cb0: total 66591 -rw-rw----. 1 vdsm kvm 30720 Mar 12 08:06 9460fc4b-54f3-48e3-b7b6-da962321ecf4 -rw-rw----. 1 vdsm kvm 1048576 Jan 31 13:24 9460fc4b-54f3-48e3-b7b6-da962321ecf4.lease -rw-r--r--. 1 vdsm kvm 435 Mar 12 08:06 9460fc4b-54f3-48e3-b7b6-da962321ecf4.meta Once it's got fixed, I have managed to start the hosted-engine properly (I have rebooted the whole cluster just to be on the safe side): [root@ovirt1 ~]# hosted-engine --vm-status --== Host ovirt1.localdomain (id: 1) status ==-- conf_on_shared_storage : True Status up-to-date : True Hostname : ovirt1.localdomain Host ID : 1 Engine status : {"health": "good", "vm": "up", "detail": "Up"} Score : 3400 stopped : False Local maintenance : False crc32 : 8ec26591 local_conf_timestamp : 49704 Host timestamp : 49704 Extra metadata (valid at timestamp): metadata_parse_version=1 metadata_feature_version=1 timestamp=49704 (Tue Mar 12 10:47:43 2019) host-id=1 score=3400 vm_conf_refresh_time=49704 (Tue Mar 12 10:47:43 2019) conf_on_shared_storage=True maintenance=False state=EngineUp stopped=False --== Host ovirt2.localdomain (id: 2) status ==-- conf_on_shared_storage : True Status up-to-date : True Hostname : ovirt2.localdomain Host ID : 2 Engine status : {"reason": "vm not running on this host", "health": "bad", "vm": "down", "detail": "unknown"} Score : 3400 stopped : False Local maintenance : False crc32 : f9f39dcd local_conf_timestamp : 14458 Host timestamp : 14458 Extra metadata (valid at timestamp): metadata_parse_version=1 metadata_feature_version=1 timestamp=14458 (Tue Mar 12 10:47:41 2019) host-id=2 score=3400 vm_conf_refresh_time=14458 (Tue Mar 12 10:47:41 2019) conf_on_shared_storage=True maintenance=False state=EngineDown stopped=False Best Regards,Strahil Nikolov В неделя, 10 март 2019 г., 5:05:33 ч. Гринуич+2, Strahil Nikolov <hunter86_bg@yahoo.com> написа: Hello again, Latest update: the engine is up and running (or at least the login portal). [root@ovirt1 ~]# hosted-engine --check-livelinessHosted Engine is up! I have found online the xml for the network: [root@ovirt1 ~]# cat ovirtmgmt_net.xml <network> <name>vdsm-ovirtmgmt</name> <forward mode='bridge'/> <bridge name='ovirtmgmt'/> </network> Sadly, I had to create a symbolic link to the main disk in /var/run/vdsm/storage , as it was missing. So, what's next. Issues up to now:2 OVF - 0 bytesProblem with local copy of the HostedEngine config - used xml from an old vdsm logMissing vdsm-ovirtmgmt definitionNo link for the main raw disk in /var/run/vdsm/storage . Can you hint me how to recover the 2 OVF tars now ? Best Regards,Strahil Nikolov

Simone Tiraboschi

5:45 a.m.

On Tue, Mar 12, 2019 at 9:48 AM Strahil Nikolov <hunter86_bg@yahoo.com> wrote:

...

Latest update - the system is back and running normally. After a day (or maybe a little more), the OVF is OK:

Normally it should try every 60 minutes. Can you please execute engine-config -g OvfUpdateIntervalInMinutes on your engine VM and check the results? it should be 60 minutes by default.

...

[root@ovirt1 ~]# ls -l /rhev/data-center/mnt/glusterSD/ovirt1.localdomain\:_engine/808423f9-8a5c-40cd-bc9f-2568c85b8c74/images/{441abdc8-6cb1-49a4-903f-a1ec0ed88429,94ade632-6ecc-4901-8cec-8e39f3d69cb0}

/rhev/data-center/mnt/glusterSD/ovirt1.localdomain:_engine/808423f9-8a5c-40cd-bc9f-2568c85b8c74/images/441abdc8-6cb1-49a4-903f-a1ec0ed88429: total 66591 -rw-rw----. 1 vdsm kvm 30720 Mar 12 08:06 c3309fc0-8707-4de1-903d-8d4bbb024f81 -rw-rw----. 1 vdsm kvm 1048576 Jan 31 13:24 c3309fc0-8707-4de1-903d-8d4bbb024f81.lease -rw-r--r--. 1 vdsm kvm 435 Mar 12 08:06 c3309fc0-8707-4de1-903d-8d4bbb024f81.meta

/rhev/data-center/mnt/glusterSD/ovirt1.localdomain:_engine/808423f9-8a5c-40cd-bc9f-2568c85b8c74/images/94ade632-6ecc-4901-8cec-8e39f3d69cb0: total 66591 -rw-rw----. 1 vdsm kvm 30720 Mar 12 08:06 9460fc4b-54f3-48e3-b7b6-da962321ecf4 -rw-rw----. 1 vdsm kvm 1048576 Jan 31 13:24 9460fc4b-54f3-48e3-b7b6-da962321ecf4.lease -rw-r--r--. 1 vdsm kvm 435 Mar 12 08:06 9460fc4b-54f3-48e3-b7b6-da962321ecf4.meta

Once it's got fixed, I have managed to start the hosted-engine properly (I have rebooted the whole cluster just to be on the safe side):

[root@ovirt1 ~]# hosted-engine --vm-status

--== Host ovirt1.localdomain (id: 1) status ==--

conf_on_shared_storage : True Status up-to-date : True Hostname : ovirt1.localdomain Host ID : 1 Engine status : {"health": "good", "vm": "up", "detail": "Up"} Score : 3400 stopped : False Local maintenance : False crc32 : 8ec26591 local_conf_timestamp : 49704 Host timestamp : 49704 Extra metadata (valid at timestamp): metadata_parse_version=1 metadata_feature_version=1 timestamp=49704 (Tue Mar 12 10:47:43 2019) host-id=1 score=3400 vm_conf_refresh_time=49704 (Tue Mar 12 10:47:43 2019) conf_on_shared_storage=True maintenance=False state=EngineUp stopped=False

--== Host ovirt2.localdomain (id: 2) status ==--

conf_on_shared_storage : True Status up-to-date : True Hostname : ovirt2.localdomain Host ID : 2 Engine status : {"reason": "vm not running on this host", "health": "bad", "vm": "down", "detail": "unknown"} Score : 3400 stopped : False Local maintenance : False crc32 : f9f39dcd local_conf_timestamp : 14458 Host timestamp : 14458 Extra metadata (valid at timestamp): metadata_parse_version=1 metadata_feature_version=1 timestamp=14458 (Tue Mar 12 10:47:41 2019) host-id=2 score=3400 vm_conf_refresh_time=14458 (Tue Mar 12 10:47:41 2019) conf_on_shared_storage=True maintenance=False state=EngineDown stopped=False

Best Regards, Strahil Nikolov

В неделя, 10 март 2019 г., 5:05:33 ч. Гринуич+2, Strahil Nikolov < hunter86_bg@yahoo.com> написа:

Hello again,

Latest update: the engine is up and running (or at least the login portal).

[root@ovirt1 ~]# hosted-engine --check-liveliness Hosted Engine is up!

I have found online the xml for the network:

[root@ovirt1 ~]# cat ovirtmgmt_net.xml <network> <name>vdsm-ovirtmgmt</name> <forward mode='bridge'/> <bridge name='ovirtmgmt'/> </network>

Sadly, I had to create a symbolic link to the main disk in /var/run/vdsm/storage , as it was missing.

So, what's next.

Issues up to now: 2 OVF - 0 bytes Problem with local copy of the HostedEngine config - used xml from an old vdsm log Missing vdsm-ovirtmgmt definition No link for the main raw disk in /var/run/vdsm/storage .

Can you hint me how to recover the 2 OVF tars now ?

Best Regards, Strahil Nikolov

Strahil Nikolov

8:14 a.m.

Dear Simone, it should be 60 min , but I have checked several hours after that and it didn't update it. [root@engine ~]# engine-config -g OvfUpdateIntervalInMinutes OvfUpdateIntervalInMinutes: 60 version: general How can i make a backup of the VM config , as you have noticed the local copy in /var/run/ovirt-hosted-engine-ha/vm.conf won't work ? I will keep the HostedEngine's xml - so I can redefine if needed. Best Regards,Strahil Nikolov

Strahil Nikolov

13 Mar 13 Mar

7:03 a.m.

Dear Simone, it seems that there is some kind of problem ,as the OVF got updated with wrong configuration:[root@ovirt2 ~]# ls -l /rhev/data-center/mnt/glusterSD/ovirt1.localdomain:_engine/808423f9-8a5c-40cd-bc9f-2568c85b8c74/images/{441abdc8-6cb1-49a4-903f-a1ec0ed88429,94ade632-6ecc-4901-8cec-8e39f3d69cb0} /rhev/data-center/mnt/glusterSD/ovirt1.localdomain:_engine/808423f9-8a5c-40cd-bc9f-2568c85b8c74/images/441abdc8-6cb1-49a4-903f-a1ec0ed88429: total 66591 -rw-rw----. 1 vdsm kvm 30720 Mar 12 08:06 c3309fc0-8707-4de1-903d-8d4bbb024f81 -rw-rw----. 1 vdsm kvm 1048576 Jan 31 13:24 c3309fc0-8707-4de1-903d-8d4bbb024f81.lease -rw-r--r--. 1 vdsm kvm 435 Mar 12 08:06 c3309fc0-8707-4de1-903d-8d4bbb024f81.meta /rhev/data-center/mnt/glusterSD/ovirt1.localdomain:_engine/808423f9-8a5c-40cd-bc9f-2568c85b8c74/images/94ade632-6ecc-4901-8cec-8e39f3d69cb0: total 66591 -rw-rw----. 1 vdsm kvm 30720 Mar 13 11:07 9460fc4b-54f3-48e3-b7b6-da962321ecf4 -rw-rw----. 1 vdsm kvm 1048576 Jan 31 13:24 9460fc4b-54f3-48e3-b7b6-da962321ecf4.lease -rw-r--r--. 1 vdsm kvm 435 Mar 13 11:07 9460fc4b-54f3-48e3-b7b6-da962321ecf4.meta Starting the hosted-engine fails with: 2019-03-13 12:48:21,237+0200 ERROR (vm/8474ae07) [virt.vm] (vmId='8474ae07-f172-4a20-b516-375c73903df7') The vm start process failed (vm:937) Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 866, in _startUnderlyingVm self._run() File "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 2852, in _run dom = self._connection.defineXML(self._domain.xml) File "/usr/lib/python2.7/site-packages/vdsm/common/libvirtconnection.py", line 131, in wrapper ret = f(*args, **kwargs) File "/usr/lib/python2.7/site-packages/vdsm/common/function.py", line 94, in wrapper return func(inst, *args, **kwargs) File "/usr/lib64/python2.7/site-packages/libvirt.py", line 3743, in defineXML if ret is None:raise libvirtError('virDomainDefineXML() failed', conn=self) libvirtError: XML error: No PCI buses available Best Regards,Strahil Nikolov В вторник, 12 март 2019 г., 14:14:26 ч. Гринуич+2, Strahil Nikolov <hunter86_bg@yahoo.com> написа: Dear Simone, it should be 60 min , but I have checked several hours after that and it didn't update it. [root@engine ~]# engine-config -g OvfUpdateIntervalInMinutes OvfUpdateIntervalInMinutes: 60 version: general How can i make a backup of the VM config , as you have noticed the local copy in /var/run/ovirt-hosted-engine-ha/vm.conf won't work ? I will keep the HostedEngine's xml - so I can redefine if needed. Best Regards,Strahil Nikolov

Strahil Nikolov

15 Mar 15 Mar

3:12 a.m.

Ok, I have managed to recover again and no issues are detected this time.I guess this case is quite rare and nobody has experienced that. Best Regards,Strahil Nikolov В сряда, 13 март 2019 г., 13:03:38 ч. Гринуич+2, Strahil Nikolov <hunter86_bg@yahoo.com> написа: Dear Simone, it seems that there is some kind of problem ,as the OVF got updated with wrong configuration:[root@ovirt2 ~]# ls -l /rhev/data-center/mnt/glusterSD/ovirt1.localdomain:_engine/808423f9-8a5c-40cd-bc9f-2568c85b8c74/images/{441abdc8-6cb1-49a4-903f-a1ec0ed88429,94ade632-6ecc-4901-8cec-8e39f3d69cb0} /rhev/data-center/mnt/glusterSD/ovirt1.localdomain:_engine/808423f9-8a5c-40cd-bc9f-2568c85b8c74/images/441abdc8-6cb1-49a4-903f-a1ec0ed88429: total 66591 -rw-rw----. 1 vdsm kvm 30720 Mar 12 08:06 c3309fc0-8707-4de1-903d-8d4bbb024f81 -rw-rw----. 1 vdsm kvm 1048576 Jan 31 13:24 c3309fc0-8707-4de1-903d-8d4bbb024f81.lease -rw-r--r--. 1 vdsm kvm 435 Mar 12 08:06 c3309fc0-8707-4de1-903d-8d4bbb024f81.meta /rhev/data-center/mnt/glusterSD/ovirt1.localdomain:_engine/808423f9-8a5c-40cd-bc9f-2568c85b8c74/images/94ade632-6ecc-4901-8cec-8e39f3d69cb0: total 66591 -rw-rw----. 1 vdsm kvm 30720 Mar 13 11:07 9460fc4b-54f3-48e3-b7b6-da962321ecf4 -rw-rw----. 1 vdsm kvm 1048576 Jan 31 13:24 9460fc4b-54f3-48e3-b7b6-da962321ecf4.lease -rw-r--r--. 1 vdsm kvm 435 Mar 13 11:07 9460fc4b-54f3-48e3-b7b6-da962321ecf4.meta Starting the hosted-engine fails with: 2019-03-13 12:48:21,237+0200 ERROR (vm/8474ae07) [virt.vm] (vmId='8474ae07-f172-4a20-b516-375c73903df7') The vm start process failed (vm:937) Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 866, in _startUnderlyingVm self._run() File "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 2852, in _run dom = self._connection.defineXML(self._domain.xml) File "/usr/lib/python2.7/site-packages/vdsm/common/libvirtconnection.py", line 131, in wrapper ret = f(*args, **kwargs) File "/usr/lib/python2.7/site-packages/vdsm/common/function.py", line 94, in wrapper return func(inst, *args, **kwargs) File "/usr/lib64/python2.7/site-packages/libvirt.py", line 3743, in defineXML if ret is None:raise libvirtError('virDomainDefineXML() failed', conn=self) libvirtError: XML error: No PCI buses available Best Regards,Strahil Nikolov В вторник, 12 март 2019 г., 14:14:26 ч. Гринуич+2, Strahil Nikolov <hunter86_bg@yahoo.com> написа: Dear Simone, it should be 60 min , but I have checked several hours after that and it didn't update it. [root@engine ~]# engine-config -g OvfUpdateIntervalInMinutes OvfUpdateIntervalInMinutes: 60 version: general How can i make a backup of the VM config , as you have noticed the local copy in /var/run/ovirt-hosted-engine-ha/vm.conf won't work ? I will keep the HostedEngine's xml - so I can redefine if needed. Best Regards,Strahil Nikolov

Simone Tiraboschi

3:43 a.m.

On Fri, Mar 15, 2019 at 8:12 AM Strahil Nikolov <hunter86_bg@yahoo.com> wrote:

...

Ok,

I have managed to recover again and no issues are detected this time. I guess this case is quite rare and nobody has experienced that.

Hi, can you please explain how you fixed it?

...

Best Regards, Strahil Nikolov

В сряда, 13 март 2019 г., 13:03:38 ч. Гринуич+2, Strahil Nikolov < hunter86_bg@yahoo.com> написа:

Dear Simone,

it seems that there is some kind of problem ,as the OVF got updated with wrong configuration: [root@ovirt2 ~]# ls -l /rhev/data-center/mnt/glusterSD/ovirt1.localdomain:_engine/808423f9-8a5c-40cd-bc9f-2568c85b8c74/images/{441abdc8-6cb1-49a4-903f-a1ec0ed88429,94ade632-6ecc-4901-8cec-8e39f3d69cb0}

/rhev/data-center/mnt/glusterSD/ovirt1.localdomain:_engine/808423f9-8a5c-40cd-bc9f-2568c85b8c74/images/441abdc8-6cb1-49a4-903f-a1ec0ed88429: total 66591 -rw-rw----. 1 vdsm kvm 30720 Mar 12 08:06 c3309fc0-8707-4de1-903d-8d4bbb024f81 -rw-rw----. 1 vdsm kvm 1048576 Jan 31 13:24 c3309fc0-8707-4de1-903d-8d4bbb024f81.lease -rw-r--r--. 1 vdsm kvm 435 Mar 12 08:06 c3309fc0-8707-4de1-903d-8d4bbb024f81.meta

/rhev/data-center/mnt/glusterSD/ovirt1.localdomain:_engine/808423f9-8a5c-40cd-bc9f-2568c85b8c74/images/94ade632-6ecc-4901-8cec-8e39f3d69cb0: total 66591 -rw-rw----. 1 vdsm kvm 30720 Mar 13 11:07 9460fc4b-54f3-48e3-b7b6-da962321ecf4 -rw-rw----. 1 vdsm kvm 1048576 Jan 31 13:24 9460fc4b-54f3-48e3-b7b6-da962321ecf4.lease -rw-r--r--. 1 vdsm kvm 435 Mar 13 11:07 9460fc4b-54f3-48e3-b7b6-da962321ecf4.meta

Starting the hosted-engine fails with:

2019-03-13 12:48:21,237+0200 ERROR (vm/8474ae07) [virt.vm] (vmId='8474ae07-f172-4a20-b516-375c73903df7') The vm start process failed (vm:937) Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 866, in _startUnderlyingVm self._run() File "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 2852, in _run dom = self._connection.defineXML(self._domain.xml) File "/usr/lib/python2.7/site-packages/vdsm/common/libvirtconnection.py", line 131, in wrapper ret = f(*args, **kwargs) File "/usr/lib/python2.7/site-packages/vdsm/common/function.py", line 94, in wrapper return func(inst, *args, **kwargs) File "/usr/lib64/python2.7/site-packages/libvirt.py", line 3743, in defineXML if ret is None:raise libvirtError('virDomainDefineXML() failed', conn=self) libvirtError: XML error: No PCI buses available

Best Regards, Strahil Nikolov

В вторник, 12 март 2019 г., 14:14:26 ч. Гринуич+2, Strahil Nikolov < hunter86_bg@yahoo.com> написа:

Dear Simone,

it should be 60 min , but I have checked several hours after that and it didn't update it.

[root@engine ~]# engine-config -g OvfUpdateIntervalInMinutes OvfUpdateIntervalInMinutes: 60 version: general

How can i make a backup of the VM config , as you have noticed the local copy in /var/run/ovirt-hosted-engine-ha/vm.conf won't work ?

I will keep the HostedEngine's xml - so I can redefine if needed.

Best Regards, Strahil Nikolov

Strahil Nikolov

6:46 a.m.

On Fri, Mar 15, 2019 at 8:12 AM Strahil Nikolov <hunter86_bg@yahoo.com> wrote: Ok, I have managed to recover again and no issues are detected this time.I guess this case is quite rare and nobody has experienced that.

...

Hi,>can you please explain how you fixed it? I have set again to global maintenance, defined the HostedEngine from the old xml (taken from old vdsm log) , defined the network and powered it off.Set the OVF update period to 5 min , but it took several hours until the OVF_STORE were updated. Once this happened I restarted the ovirt-ha-agent ovirt-ha-broker on both nodes.Then I powered off the HostedEngine and undefined it from ovirt1.

then I set the maintenance to 'none' and the VM powered on ovirt1. In order to test a failure, I removed the global maintenance and powered off the HostedEngine from itself (via ssh). It was brought back to the other node. In order to test failure of ovirt2, I set ovirt1 in local maintenance and removed it (mode 'none') and again shutdown the VM via ssh and it started again to ovirt1. It seems to be working, as I have later shut down the Engine several times and it managed to start without issues. I'm not sure this is related, but I had detected that ovirt2 was out-of-sync of the vdsm-ovirtmgmt network , but it got fixed easily via the UI. Best Regards,Strahil Nikolov

Николаев Алексей

18 Mar 18 Mar

6:54 a.m.

Strahil Nikolov

7:52 a.m.

Hi Alexei, In order to debug it check the following: 1. Check gluster:1.1 All bricks up ?1.2 All bricks healed (gluster volume heal data info summary) and no split-brain 2. Go to the problematic host and check the mount point is there2.1. Check permissions (should be vdsm:kvm) and fix with chown -R if needed2.2. Check the OVF_STORE from the logs that it exists2.3. Check that vdsm can extract the file:sudo -u vdsm tar -tvf /rhev/data-center/mnt/glusterSD/msk-gluster-facility.xxxx:_data/DOMAIN-UUID/Volume-UUID/Image-ID 3 Configure virsh alias, as it's quite helpful:alias virsh='virsh -c qemu:///system?authfile=/etc/ovirt-hosted-engine/virsh_auth.conf' 4. If VM is running - go to the host and get the xml:virsh dumpxml HostedEngine > /root/HostedEngine.xml4.1. Get the Network:virsh net-dumpxml vdsm-ovirtmgmt > /root/vdsm-ovirtmgmt.xml4.2 If not , Here is mine:[root@ovirt1 ~]# virsh net-dumpxml vdsm-ovirtmgmt <network> <name>vdsm-ovirtmgmt</name> <uuid>7ae538ce-d419-4dae-93b8-3a4d27700227</uuid> <forward mode='bridge'/> <bridge name='ovirtmgmt'/> </network> UUID is not important, as my first recovery was with different one. 5. If you Hosted Engine is down:5.1 Remove the VM (if exists anywhere)on all nodes:virsh undefine HostedEngine5.2 Verify that the nodes are in global maintenance:hosted-engine --vm-status5.3 Define the Engine on only 1 machinevirsh define HostedEngine.xmlvirsh net-define vdsm-ovirtmgmt.xml virsh start HostedEngine Note: if it complains about the storage - there is no link in /var/run/vdsm/storage/DOMAIN-UUID/Volume-UUID to your Volume-UUIDHere is how it looks mine:[root@ovirt1 808423f9-8a5c-40cd-bc9f-2568c85b8c74]# ll /var/run/vdsm/storage/808423f9-8a5c-40cd-bc9f-2568c85b8c74 total 24 lrwxrwxrwx. 1 vdsm kvm 139 Mar 17 07:42 2c74697a-8bd9-4472-8a98-bf624f3462d5 -> /rhev/data-center/mnt/glusterSD/ovirt1.localdomain:_engine/808423f9-8a5c-40cd-bc9f-2568c85b8c74/images/2c74697a-8bd9-4472-8a98-bf624f3462d5 lrwxrwxrwx. 1 vdsm kvm 139 Mar 17 07:45 3ec27d6d-921c-4348-b799-f50543b6f919 -> /rhev/data-center/mnt/glusterSD/ovirt1.localdomain:_engine/808423f9-8a5c-40cd-bc9f-2568c85b8c74/images/3ec27d6d-921c-4348-b799-f50543b6f919 lrwxrwxrwx. 1 vdsm kvm 139 Mar 17 08:28 441abdc8-6cb1-49a4-903f-a1ec0ed88429 -> /rhev/data-center/mnt/glusterSD/ovirt1.localdomain:_engine/808423f9-8a5c-40cd-bc9f-2568c85b8c74/images/441abdc8-6cb1-49a4-903f-a1ec0ed88429 lrwxrwxrwx. 1 vdsm kvm 139 Mar 17 21:15 8ec7a465-151e-4ac3-92a7-965ecf854501 -> /rhev/data-center/mnt/glusterSD/ovirt1.localdomain:_engine/808423f9-8a5c-40cd-bc9f-2568c85b8c74/images/8ec7a465-151e-4ac3-92a7-965ecf854501 lrwxrwxrwx. 1 vdsm kvm 139 Mar 17 08:28 94ade632-6ecc-4901-8cec-8e39f3d69cb0 -> /rhev/data-center/mnt/glusterSD/ovirt1.localdomain:_engine/808423f9-8a5c-40cd-bc9f-2568c85b8c74/images/94ade632-6ecc-4901-8cec-8e39f3d69cb0 lrwxrwxrwx. 1 vdsm kvm 139 Mar 17 07:42 fe62a281-51e9-4b23-87b3-2deb52357304 -> /rhev/data-center/mnt/glusterSD/ovirt1.localdomain:_engine/808423f9-8a5c-40cd-bc9f-2568c85b8c74/images/fe62a281-51e9-4b23-87b3-2deb52357304 Once you create your link , start it again. 6. Wait till OVF is fixed (takes more than the settings in the engine :) ) Good Luck! Best Regards,Strahil Nikolov В понеделник, 18 март 2019 г., 12:57:30 ч. Гринуич+2, Николаев Алексей <alexeynikolaev.post@yandex.ru> написа: Hi all! I have a very similar problem after update one of the two nodes to version 4.3.1. This node77-02 lost connection to gluster volume named DATA, but not to volume with hosted engine. node77-02 /var/log/messages Mar 18 13:40:00 node77-02 journal: ovirt-ha-agent ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine.config.vm ERROR Failed scanning for OVF_STORE due to Command Volume.getInfo with args {'storagepoolID': '00000000-0000-0000-0000-000000000000', 'storagedomainID': '2ee71105-1810-46eb-9388-cc6caccf9fac', 'volumeID': u'224e4b80-2744-4d7f-bd9f-43eb8fe6cf11', 'imageID': u'43b75b50-cad4-411f-8f51-2e99e52f4c77'} failed:#012(code=201, message=Volume does not exist: (u'224e4b80-2744-4d7f-bd9f-43eb8fe6cf11',))Mar 18 13:40:00 node77-02 journal: ovirt-ha-agent ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine.config.vm ERROR Unable to identify the OVF_STORE volume, falling back to initial vm.conf. Please ensure you already added your first data domain for regular VMs HostedEngine VM works fine on all nodes. But node77-02 failed witherror in webUI: ConnectStoragePoolVDS failed: Cannot find master domain: u'spUUID=5a5cca91-01f8-01af-0297-00000000025f, msdUUID=7d5de684-58ff-4fbc-905d-3048fc55b2b1' node77-02 vdsm.log 2019-03-18 13:51:46,287+0300 WARN (jsonrpc/7) [storage.StorageServer.MountConnection] gluster server u'msk-gluster-facility.xxxx' is not in bricks ['node-msk-gluster203', 'node-msk-gluster205', 'node-msk-gluster201'], possibly mounting duplicate servers (storageServer:317)2019-03-18 13:51:46,287+0300 INFO (jsonrpc/7) [storage.Mount] mounting msk-gluster-facility.ipt.fsin.uis:/data at /rhev/data-center/mnt/glusterSD/msk-gluster-facility.xxxx:_data (mount:204)2019-03-18 13:51:46,474+0300 ERROR (jsonrpc/7) [storage.HSM] Could not connect to storageServer (hsm:2415)Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/vdsm/storage/hsm.py", line 2412, in connectStorageServer conObj.connect() File "/usr/lib/python2.7/site-packages/vdsm/storage/storageServer.py", line 179, in connect six.reraise(t, v, tb) File "/usr/lib/python2.7/site-packages/vdsm/storage/storageServer.py", line 171, in connect self._mount.mount(self.options, self._vfsType, cgroup=self.CGROUP) File "/usr/lib/python2.7/site-packages/vdsm/storage/mount.py", line 207, in mount cgroup=cgroup) File "/usr/lib/python2.7/site-packages/vdsm/common/supervdsm.py", line 55, in __call__ return callMethod() File "/usr/lib/python2.7/site-packages/vdsm/common/supervdsm.py", line 53, in <lambda> **kwargs) File "<string>", line 2, in mount File "/usr/lib64/python2.7/multiprocessing/managers.py", line 773, in _callmethod raise convert_to_error(kind, result)MountError: (1, ';Running scope as unit run-10121.scope.\nMount failed. Please check the log file for more details.\n') ------------------------------ 2019-03-18 13:51:46,830+0300 ERROR (jsonrpc/4) [storage.TaskManager.Task] (Task='fe81642e-2421-4169-a08b-51467e8f01fe') Unexpected error (task:875)Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/vdsm/storage/task.py", line 882, in _run return fn(*args, **kargs) File "<string>", line 2, in connectStoragePool File "/usr/lib/python2.7/site-packages/vdsm/common/api.py", line 48, in method ret = func(*args, **kwargs) File "/usr/lib/python2.7/site-packages/vdsm/storage/hsm.py", line 1035, in connectStoragePool spUUID, hostID, msdUUID, masterVersion, domainsMap) File "/usr/lib/python2.7/site-packages/vdsm/storage/hsm.py", line 1097, in _connectStoragePool res = pool.connect(hostID, msdUUID, masterVersion) File "/usr/lib/python2.7/site-packages/vdsm/storage/sp.py", line 700, in connect self.__rebuild(msdUUID=msdUUID, masterVersion=masterVersion) File "/usr/lib/python2.7/site-packages/vdsm/storage/sp.py", line 1274, in __rebuild self.setMasterDomain(msdUUID, masterVersion) File "/usr/lib/python2.7/site-packages/vdsm/storage/sp.py", line 1495, in setMasterDomain raise se.StoragePoolMasterNotFound(self.spUUID, msdUUID)StoragePoolMasterNotFound: Cannot find master domain: u'spUUID=5a5cca91-01f8-01af-0297-00000000025f, msdUUID=7d5de684-58ff-4fbc-905d-3048fc55b2b1' What the bestpractice to recovery this problem? 15.03.2019, 13:47, "Strahil Nikolov" <hunter86_bg@yahoo.com>: On Fri, Mar 15, 2019 at 8:12 AM Strahil Nikolov <hunter86_bg@yahoo.com> wrote: Ok, I have managed to recover again and no issues are detected this time.I guess this case is quite rare and nobody has experienced that. >Hi,>can you please explain how you fixed it? I have set again to global maintenance, defined the HostedEngine from the old xml (taken from old vdsm log) , defined the network and powered it off.Set the OVF update period to 5 min , but it took several hours until the OVF_STORE were updated. Once this happened I restarted the ovirt-ha-agent ovirt-ha-broker on both nodes.Then I powered off the HostedEngine and undefined it from ovirt1. then I set the maintenance to 'none' and the VM powered on ovirt1.In order to test a failure, I removed the global maintenance and powered off the HostedEngine from itself (via ssh). It was brought back to the other node. In order to test failure of ovirt2, I set ovirt1 in local maintenance and removed it (mode 'none') and again shutdown the VM via ssh and it started again to ovirt1. It seems to be working, as I have later shut down the Engine several times and it managed to start without issues. I'm not sure this is related, but I had detected that ovirt2 was out-of-sync of the vdsm-ovirtmgmt network , but it got fixed easily via the UI. Best Regards,Strahil Nikolov , _______________________________________________ Users mailing list -- users@ovirt.org To unsubscribe send an email to users-leave@ovirt.org Privacy Statement: https://www.ovirt.org/site/privacy-policy/ oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/users@ovirt.org/message/3B7OQUA733ETUA... _______________________________________________ Users mailing list -- users@ovirt.org To unsubscribe send an email to users-leave@ovirt.org Privacy Statement: https://www.ovirt.org/site/privacy-policy/ oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/users@ovirt.org/message/EMPIGC7JHHWZOO...

Николаев Алексей

11:05 a.m.

2442

Age (days ago)

2454

Last active (days ago)

List overview

Download

21 comments

4 participants

participants (4)

Simone Tiraboschi
Strahil
Strahil Nikolov
Николаев Алексей

Ovirt 4.3.1 problem with HA agent

tags

participants (4)