Hi Simone,

and thanks for your help.

So far I found out that there is some problem with the local copy of the HostedEngine config (see attached part of vdsm.log).

I have found out an older xml configuration (in an old vdsm.log) and defining the VM works, but powering it on reports:

[root@ovirt1 ~]# virsh define hosted-engine.xml
Domain HostedEngine defined from hosted-engine.xml

[root@ovirt1 ~]# virsh list --all
 Id    Name                           State
----------------------------------------------------
 -     HostedEngine                   shut off

[root@ovirt1 ~]# virsh start HostedEngine
error: Failed to start domain HostedEngine
error: Network not found: no network with matching name 'vdsm-ovirtmgmt'

[root@ovirt1 ~]# virsh net-list --all
 Name                 State      Autostart     Persistent
----------------------------------------------------------
 ;vdsmdummy;          active     no            no
 default              inactive   no            yes

[root@ovirt1 ~]# brctl show
bridge name     bridge id               STP enabled     interfaces
;vdsmdummy;             8000.000000000000       no
ovirtmgmt               8000.bc5ff467f5b3       no              enp2s0

[root@ovirt1 ~]# ip a s
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host 
       valid_lft forever preferred_lft forever
2: enp2s0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc mq master ovirtmgmt state UP group default qlen 1000
    link/ether bc:5f:f4:67:f5:b3 brd ff:ff:ff:ff:ff:ff
3: ovs-system: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN group default qlen 1000
    link/ether f6:78:c7:2d:32:f9 brd ff:ff:ff:ff:ff:ff
4: br-int: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN group default qlen 1000
    link/ether 66:36:dd:63:dc:48 brd ff:ff:ff:ff:ff:ff
20: ovirtmgmt: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc noqueue state UP group default qlen 1000
    link/ether bc:5f:f4:67:f5:b3 brd ff:ff:ff:ff:ff:ff
    inet 192.168.1.90/24 brd 192.168.1.255 scope global ovirtmgmt
       valid_lft forever preferred_lft forever
    inet 192.168.1.243/24 brd 192.168.1.255 scope global secondary ovirtmgmt
       valid_lft forever preferred_lft forever
    inet6 fe80::be5f:f4ff:fe67:f5b3/64 scope link 
       valid_lft forever preferred_lft forever
21: ;vdsmdummy;: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN group default qlen 1000
    link/ether ce:36:8d:b7:64:bd brd ff:ff:ff:ff:ff:ff


192.168.1.243/24 is the one of the IPs in ctdb..


So , now comes the question - is there an xml in the logs that defines the network ?
My hope is to power up the HostedEngine properly and hope that it will push all the configurations to the right places ... maybe this is way too optimistic.

At least I have learned a lot for oVirt.

Best Regards,
Strahil Nikolov



В четвъртък, 7 март 2019 г., 17:55:12 ч. Гринуич+2, Simone Tiraboschi <stirabos@redhat.com> написа:




On Thu, Mar 7, 2019 at 2:54 PM Strahil Nikolov <hunter86_bg@yahoo.com> wrote:



>The OVF_STORE volume is going to get periodically recreated by the engine so at least you need a running engine.

>In order to avoid this kind of issue we have two OVF_STORE disks, in your case:

>MainThread::INFO::2019-03-06 06:50:02,391::ovf_store::120::ovirt_hosted_engine_ha.lib.ovf.ovf_store.OVFStore::(scan) Found >OVF_STORE: imgUUID:441abdc8-6cb1-49a4-903f-a1ec0ed88429, volUUID:c3309fc0-8707-4de1-903d-8d4bbb024f81
>MainThread::INFO::2019-03-06 06:50:02,748::ovf_store::120::ovirt_hosted_engine_ha.lib.ovf.ovf_store.OVFStore::(scan) Found >OVF_STORE: imgUUID:94ade632-6ecc-4901-8cec-8e39f3d69cb0, volUUID:9460fc4b-54f3-48e3-b7b6-da962321ecf4

>Can you please check if you have at lest the second copy?

Second Copy is empty too:
[root@ovirt1 ~]# ll /rhev/data-center/mnt/glusterSD/ovirt1.localdomain:_engine/808423f9-8a5c-40cd-bc9f-2568c85b8c74/images/441abdc8-6cb1-49a4-903f-a1ec0ed88429
total 66561
-rw-rw----. 1 vdsm kvm       0 Mar  4 05:23 c3309fc0-8707-4de1-903d-8d4bbb024f81
-rw-rw----. 1 vdsm kvm 1048576 Jan 31 13:24 c3309fc0-8707-4de1-903d-8d4bbb024f81.lease
-rw-r--r--. 1 vdsm kvm     435 Mar  4 05:24 c3309fc0-8707-4de1-903d-8d4bbb024f81.meta



>And even in the case you lost both, we are storing on the shared storage the initial vm.conf:
>MainThread::ERROR::2019-03-06 >06:50:02,971::config_ovf::70::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine.config.vm::>(_get_vm_conf_content_from_ovf_store) Failed extracting VM OVF from the OVF_STORE volume, falling back to initial vm.conf

>Can you please check what do you have in /var/run/ovirt-hosted-engine-ha/vm.conf ?
 
It exists and has the following:

[root@ovirt1 ~]# cat /var/run/ovirt-hosted-engine-ha/vm.conf
# Editing the hosted engine VM is only possible via the manager UI\API
# This file was generated at Thu Mar  7 15:37:26 2019

vmId=8474ae07-f172-4a20-b516-375c73903df7
memSize=4096
display=vnc
devices={index:2,iface:ide,address:{ controller:0, target:0,unit:0, bus:1, type:drive},specParams:{},readonly:true,deviceId:,path:,device:cdrom,shared:false,type:disk}
devices={index:0,iface:virtio,format:raw,poolID:00000000-0000-0000-0000-000000000000,volumeID:a9ab832f-c4f2-4b9b-9d99-6393fd995979,imageID:8ec7a465-151e-4ac3-92a7-965ecf854501,specParams:{},readonly:false,domainID:808423f9-8a5c-40cd-bc9f-2568c85b8c74,optional:false,deviceId:a9ab832f-c4f2-4b9b-9d99-6393fd995979,address:{bus:0x00, slot:0x06, domain:0x0000, type:pci, function:0x0},device:disk,shared:exclusive,propagateErrors:off,type:disk,bootOrder:1}
devices={device:scsi,model:virtio-scsi,type:controller}
devices={nicModel:pv,macAddr:00:16:3e:62:72:c8,linkActive:true,network:ovirtmgmt,specParams:{},deviceId:,address:{bus:0x00, slot:0x03, domain:0x0000, type:pci, function:0x0},device:bridge,type:interface}
devices={device:console,type:console}
devices={device:vga,alias:video0,type:video}
devices={device:vnc,type:graphics}
vmName=HostedEngine
spiceSecureChannels=smain,sdisplay,sinputs,scursor,splayback,srecord,ssmartcard,susbredir
smp=1
maxVCpus=8
cpuType=Opteron_G5
emulatedMachine=emulated_machine_list.json['values']['system_option_value'][0]['value'].replace('[','').replace(']','').split(', ')|first
devices={device:virtio,specParams:{source:urandom},model:virtio,type:rng}

You should be able to copy it to /root/myvm.conf.xml and start the engine VM with
hosted-engine --vm-start --vm-conf=/root/myvm.conf
 



Also, I think this happened when I was upgrading ovirt1 (last in the gluster cluster) from 4.3.0 to 4.3.1 . The engine got restarted , because I forgot to enable the global maintenance.

>Sorry, I don't understand
>Can you please explain what happened?

I have updated the engine first -> All OK, next was the arbiter -> again no issues with it.
Next was the empty host -> ovirt2 and everything went OK.
After that I migrated the engine to ovirt2 , and tried to updated ovirt1.
The web showed that the installation failed, but using "yum update" was working.
During the update via yum of ovirt1 -> the engine app crashed and restarted on ovirt2.
After the reboot of ovirt1 I have noticed the error about pinging the gateway ,thus I stopped the engine and stopped the following services on both hosts (global maintenance):
ovirt-ha-agent ovirt-ha-broker vdsmd supervdsmd sanlock

Next was a reinitialization of the sanlock space via 'sanlock direct -s'.
In the end I have managed to power on the hosted-engine and it was running for a while.

As the errors did not stop - I have decided to shutdown everything, then power it up , heal gluster and check what will happen.

Currently I'm not able to power up the engine:


[root@ovirt1 ovirt-hosted-engine-ha]# hosted-engine --vm-status


!! Cluster is in GLOBAL MAINTENANCE mode !!

Please notice that in global maintenance mode nothing will try to start the engine VM for you.
I assume you tried to exit global maintenance mode or at least you tried to manually start it with hosted-engine --vm-start, right?
 



--== Host ovirt1.localdomain (id: 1) status ==--

conf_on_shared_storage             : True
Status up-to-date                  : True
Hostname                           : ovirt1.localdomain
Host ID                            : 1
Engine status                      : {"reason": "vm not running on this host", "health": "bad", "vm": "down", "detail": "unknown"}
Score                              : 3400
stopped                            : False
Local maintenance                  : False
crc32                              : 45e6772b
local_conf_timestamp               : 288
Host timestamp                     : 287
Extra metadata (valid at timestamp):
        metadata_parse_version=1
        metadata_feature_version=1
        timestamp=287 (Thu Mar  7 15:34:06 2019)
        host-id=1
        score=3400
        vm_conf_refresh_time=288 (Thu Mar  7 15:34:07 2019)
        conf_on_shared_storage=True
        maintenance=False
        state=GlobalMaintenance
        stopped=False


--== Host ovirt2.localdomain (id: 2) status ==--

conf_on_shared_storage             : True
Status up-to-date                  : True
Hostname                           : ovirt2.localdomain
Host ID                            : 2
Engine status                      : {"reason": "vm not running on this host", "health": "bad", "vm": "down", "detail": "unknown"}
Score                              : 3400
stopped                            : False
Local maintenance                  : False
crc32                              : 2e9a0444
local_conf_timestamp               : 3886
Host timestamp                     : 3885
Extra metadata (valid at timestamp):
        metadata_parse_version=1
        metadata_feature_version=1
        timestamp=3885 (Thu Mar  7 15:34:05 2019)
        host-id=2
        score=3400
        vm_conf_refresh_time=3886 (Thu Mar  7 15:34:06 2019)
        conf_on_shared_storage=True
        maintenance=False
        state=GlobalMaintenance
        stopped=False


!! Cluster is in GLOBAL MAINTENANCE mode !!

[root@ovirt1 ovirt-hosted-engine-ha]# hosted-engine --vm-start
Command VM.getStats with args {'vmID': '8474ae07-f172-4a20-b516-375c73903df7'} failed:
(code=1, message=Virtual machine does not exist: {'vmId': u'8474ae07-f172-4a20-b516-375c73903df7'})
[root@ovirt1 ovirt-hosted-engine-ha]# hosted-engine --vm-start
VM exists and is down, cleaning up and restarting

[root@ovirt1 ovirt-hosted-engine-ha]# hosted-engine --vm-status


!! Cluster is in GLOBAL MAINTENANCE mode !!



--== Host ovirt1.localdomain (id: 1) status ==--

conf_on_shared_storage             : True
Status up-to-date                  : True
Hostname                           : ovirt1.localdomain
Host ID                            : 1
Engine status                      : {"reason": "bad vm status", "health": "bad", "vm": "down", "detail": "Down"}
Score                              : 3400
stopped                            : False
Local maintenance                  : False
crc32                              : 6b086b7c
local_conf_timestamp               : 328
Host timestamp                     : 327
Extra metadata (valid at timestamp):
        metadata_parse_version=1
        metadata_feature_version=1
        timestamp=327 (Thu Mar  7 15:34:46 2019)
        host-id=1
        score=3400
        vm_conf_refresh_time=328 (Thu Mar  7 15:34:47 2019)
        conf_on_shared_storage=True
        maintenance=False
        state=GlobalMaintenance
        stopped=False


--== Host ovirt2.localdomain (id: 2) status ==--

conf_on_shared_storage             : True
Status up-to-date                  : True
Hostname                           : ovirt2.localdomain
Host ID                            : 2
Engine status                      : {"reason": "vm not running on this host", "health": "bad", "vm": "down", "detail": "unknown"}
Score                              : 3400
stopped                            : False
Local maintenance                  : False
crc32                              : c5890e9c
local_conf_timestamp               : 3926
Host timestamp                     : 3925
Extra metadata (valid at timestamp):
        metadata_parse_version=1
        metadata_feature_version=1
        timestamp=3925 (Thu Mar  7 15:34:45 2019)
        host-id=2
        score=3400
        vm_conf_refresh_time=3926 (Thu Mar  7 15:34:45 2019)
        conf_on_shared_storage=True
        maintenance=False
        state=GlobalMaintenance
        stopped=False


!! Cluster is in GLOBAL MAINTENANCE mode !!

[root@ovirt1 ovirt-hosted-engine-ha]# virsh list --all
 Id    Name                           State
----------------------------------------------------
 -     HostedEngine                   shut off

I am really puzzled why both volumes are wiped out .

This is really scaring: can you please double check gluster logs for warning and errors?

 


Best Regards,
Strahil Nikolov