hosted-engine vm-status stale data and cluster seems "broken"

See linked images please. As you can see all three nodes are reporting stale data. The results of this are: 1. Not all VM's migrate seamlessly in the cluster. Sometimes I have to shut them down to get them to be able to migrate again 2. Hosted engine refuses to move due to constraints (image). This part doesnt make sense to me because I can forcefully shut it down and then go directly on a hosted engine node and bring it back up. Also, the Web UI shows all nodes under the cluster except then it thinks its not apart of the cluster 3. Time is in sync (image) 4. Storage is 100% fine. Gluster back end reports mirroring and status 'started'. No split brain has occurred and ovirt nodes have never lost connectivity to storage 5. I reinstalled all three nodes. For some reason only node 3 still shows as having updates available. (image). For clarity, I did not click "upgrade" I simply did 'reinstall' from the Web UI. Having looked at the output and yum.log from /var/log it almost looks like it did do an update. All package versions across all three nodes are the same (respective to ovirt/vdsm) (image). For some reason though ovirt-engine-appliance-3.6-20160126.1.el7.centos.noarch exists on node 1 but not on node 2 or 3. Could this be relative? I dont recall installing that specifically on node 1 but I may have Been slamming my head on this so I am hoping you can provide some assistance http://imgur.com/a/6xkaS Thanks! -- *Charles Kozler* *Vice President, IT Operations* FIX Flyer, LLC 225 Broadway | Suite 1600 | New York, NY 10007 1-888-349-3593 http://www.fixflyer.com <http://fixflyer.com> NOTICE TO RECIPIENT: THIS E-MAIL IS MEANT ONLY FOR THE INTENDED RECIPIENT(S) OF THE TRANSMISSION, AND CONTAINS CONFIDENTIAL INFORMATION WHICH IS PROPRIETARY TO FIX FLYER LLC. ANY UNAUTHORIZED USE, COPYING, DISTRIBUTION, OR DISSEMINATION IS STRICTLY PROHIBITED. ALL RIGHTS TO THIS INFORMATION IS RESERVED BY FIX FLYER LLC. IF YOU ARE NOT THE INTENDED RECIPIENT, PLEASE CONTACT THE SENDER BY REPLY E-MAIL AND PLEASE DELETE THIS E-MAIL FROM YOUR SYSTEM AND DESTROY ANY COPIES.

Actually to add to this - it previously said that I couldn't migrate hosted engine because 2 nodes (ones it wasn't running on) were not apart of the cluster and it couldn't migrate it to the node it was in because it was already on that node. This image is post forcefully shutting it down and bringing it up manually on node 2 via hosted-engine --vm-start On Jun 11, 2016 2:53 PM, "Charles Kozler" <charles@fixflyer.com> wrote:
See linked images please. As you can see all three nodes are reporting stale data. The results of this are:
1. Not all VM's migrate seamlessly in the cluster. Sometimes I have to shut them down to get them to be able to migrate again
2. Hosted engine refuses to move due to constraints (image). This part doesnt make sense to me because I can forcefully shut it down and then go directly on a hosted engine node and bring it back up. Also, the Web UI shows all nodes under the cluster except then it thinks its not apart of the cluster
3. Time is in sync (image)
4. Storage is 100% fine. Gluster back end reports mirroring and status 'started'. No split brain has occurred and ovirt nodes have never lost connectivity to storage
5. I reinstalled all three nodes. For some reason only node 3 still shows as having updates available. (image). For clarity, I did not click "upgrade" I simply did 'reinstall' from the Web UI. Having looked at the output and yum.log from /var/log it almost looks like it did do an update. All package versions across all three nodes are the same (respective to ovirt/vdsm) (image). For some reason though ovirt-engine-appliance-3.6-20160126.1.el7.centos.noarch exists on node 1 but not on node 2 or 3. Could this be relative? I dont recall installing that specifically on node 1 but I may have
Been slamming my head on this so I am hoping you can provide some assistance
Thanks!
--
*Charles Kozler* *Vice President, IT Operations*
FIX Flyer, LLC 225 Broadway | Suite 1600 | New York, NY 10007 1-888-349-3593 http://www.fixflyer.com <http://fixflyer.com>
NOTICE TO RECIPIENT: THIS E-MAIL IS MEANT ONLY FOR THE INTENDED RECIPIENT(S) OF THE TRANSMISSION, AND CONTAINS CONFIDENTIAL INFORMATION WHICH IS PROPRIETARY TO FIX FLYER LLC. ANY UNAUTHORIZED USE, COPYING, DISTRIBUTION, OR DISSEMINATION IS STRICTLY PROHIBITED. ALL RIGHTS TO THIS INFORMATION IS RESERVED BY FIX FLYER LLC. IF YOU ARE NOT THE INTENDED RECIPIENT, PLEASE CONTACT THE SENDER BY REPLY E-MAIL AND PLEASE DELETE THIS E-MAIL FROM YOUR SYSTEM AND DESTROY ANY COPIES.

I had similar errors with one single host and a hosted-engine VM. My case should be totally different, but one thing you could try first is to check VM is really up. In my issues, VM was shown by hosted-engine command as up, but was down. with vdsClient command, you can check if it's status with more details. What is the result for you of the following command ? vdsClient -s 0 list

It is up. I can do "ps -Aef | grep -i qemu-kvm | grep -i hosted" and see it running. I also forcefully shut it down with hosted-engine --vm-stop when it was on node1 and then did --vm-start on node 2 and it came up. Also the Web UI is reachable so thats how I also know the hosted engine VM is running On Mon, Jun 13, 2016 at 8:24 AM, Alexis HAUSER < alexis.hauser@telecom-bretagne.eu> wrote:
I had similar errors with one single host and a hosted-engine VM. My case should be totally different, but one thing you could try first is to check VM is really up. In my issues, VM was shown by hosted-engine command as up, but was down. with vdsClient command, you can check if it's status with more details.
What is the result for you of the following command ?
vdsClient -s 0 list
-- *Charles Kozler* *Vice President, IT Operations* FIX Flyer, LLC 225 Broadway | Suite 1600 | New York, NY 10007 1-888-349-3593 http://www.fixflyer.com <http://fixflyer.com> NOTICE TO RECIPIENT: THIS E-MAIL IS MEANT ONLY FOR THE INTENDED RECIPIENT(S) OF THE TRANSMISSION, AND CONTAINS CONFIDENTIAL INFORMATION WHICH IS PROPRIETARY TO FIX FLYER LLC. ANY UNAUTHORIZED USE, COPYING, DISTRIBUTION, OR DISSEMINATION IS STRICTLY PROHIBITED. ALL RIGHTS TO THIS INFORMATION IS RESERVED BY FIX FLYER LLC. IF YOU ARE NOT THE INTENDED RECIPIENT, PLEASE CONTACT THE SENDER BY REPLY E-MAIL AND PLEASE DELETE THIS E-MAIL FROM YOUR SYSTEM AND DESTROY ANY COPIES.

Anyone have any other possible information on this? I've noticed this issue before and usually it just takes a bit of time for the cluster to 'settle' after some node reboots but its been a few days and its still marked as stale. --== Host 1 status ==-- Status up-to-date : False Hostname : njsevcnp01 Host ID : 1 Engine status : unknown stale-data Score : 0 stopped : True Local maintenance : False crc32 : 260dbf06 Host timestamp : 327 --== Host 2 status ==-- Status up-to-date : False Hostname : njsevcnp02 Host ID : 2 Engine status : unknown stale-data Score : 0 stopped : True Local maintenance : False crc32 : 25da07df Host timestamp : 3030 --== Host 3 status ==-- Status up-to-date : False Hostname : njsevcnp03 Host ID : 3 Engine status : unknown stale-data Score : 0 stopped : True Local maintenance : False crc32 : c67818cb Host timestamp : 10877406 && vdsClient on node2 showing hosted engine is up on node 2 48207078-8cb0-413c-8984-40aa772f4d94 Status = Up nicModel = rtl8139,pv statusTime = 4540044460 emulatedMachine = pc pid = 30571 vmName = HostedEngine devices = [{'device': 'memballoon', 'specParams': {'model': 'none'}, 'type': 'balloon', 'alias': 'balloon0'}, {'alias': 'scsi0', 'deviceId': '17f10db1-2e9e-4422-9ea5-61a628072e29', 'address': {'slot': '0x04', 'bus': '0x00', 'domain': '0x0000', 'type': 'pci', 'function': '0x0'}, 'device': 'scsi', 'model': 'virtio-scsi', 'type': 'controller'}, {'device': 'usb', 'alias': 'usb', 'type': 'controller', 'deviceId': '9be34ac0-7d00-4a95-bdfe-5b328fc1355b', 'address': {'slot': '0x01', 'bus': '0x00', 'domain': '0x0000', 'type': 'pci', 'function': '0x2'}}, {'device': 'ide', 'alias': 'ide', 'type': 'controller', 'deviceId': '222629a8-0dd6-4e8e-9b42-43aac314c0c2', 'address': {'slot': '0x01', 'bus': '0x00', 'domain': '0x0000', 'type': 'pci', 'function': '0x1'}}, {'device': 'virtio-serial', 'alias': 'virtio-serial0', 'type': 'controller', 'deviceId': '7cbccd04-853a-408f-94c2-5b10b641b7af', 'address': {'slot': '0x05', 'bus': '0x00', 'domain': '0x0000', 'type': 'pci', 'function': '0x0'}}, {'device': 'vnc', 'specParams': {'spiceSecureChannels': 'smain,sdisplay,sinputs,scursor,splayback,srecord,ssmartcard,susbredir', 'displayIp': '0'}, 'type': 'graphics', 'port': '5900'}, {'nicModel': 'pv', 'macAddr': '00:16:3e:16:83:91', 'linkActive': True, 'network': 'ovirtmgmt', 'alias': 'net0', 'deviceId': '3f679659-142c-41f3-a69d-4264d7234fbc', 'address': {'slot': '0x03', 'bus': '0x00', 'domain': '0x0000', 'type': 'pci', 'function': '0x0'}, 'device': 'bridge', 'type': 'interface', 'name': 'vnet0'}, {'address': {'slot': '0x06', 'bus': '0x00', 'domain': '0x0000', 'type': 'pci', 'function': '0x0'}, 'volumeInfo': {'domainID': 'c6323975-2966-409d-b9e0-48370a513a98', 'volType': 'path', 'leaseOffset': 0, 'volumeID': 'aa66d378-5a5f-490c-b0ab-993b79838d95', 'leasePath': '/rhev/data-center/mnt/nas01:_volume1_vm__os_ovirt36__engine/c6323975-2966-409d-b9e0-48370a513a98/images/8518ef4a-7b17-4291-856c-81875ba4e264/aa66d378-5a5f-490c-b0ab-993b79838d95.lease', 'imageID': '8518ef4a-7b17-4291-856c-81875ba4e264', 'path': '/rhev/data-center/mnt/nas01:_volume1_vm__os_ovirt36__engine/c6323975-2966-409d-b9e0-48370a513a98/images/8518ef4a-7b17-4291-856c-81875ba4e264/aa66d378-5a5f-490c-b0ab-993b79838d95'}, 'index': '0', 'iface': 'virtio', 'apparentsize': '10737418240', 'imageID': '8518ef4a-7b17-4291-856c-81875ba4e264', 'readonly': 'False', 'shared': 'exclusive', 'truesize': '6899802112', 'type': 'disk', 'domainID': 'c6323975-2966-409d-b9e0-48370a513a98', 'reqsize': '0', 'format': 'raw', 'deviceId': '8518ef4a-7b17-4291-856c-81875ba4e264', 'poolID': '00000000-0000-0000-0000-000000000000', 'device': 'disk', 'path': '/var/run/vdsm/storage/c6323975-2966-409d-b9e0-48370a513a98/8518ef4a-7b17-4291-856c-81875ba4e264/aa66d378-5a5f-490c-b0ab-993b79838d95', 'propagateErrors': 'off', 'name': 'vda', 'bootOrder': '1', 'volumeID': 'aa66d378-5a5f-490c-b0ab-993b79838d95', 'alias': 'virtio-disk0', 'volumeChain': [{'domainID': 'c6323975-2966-409d-b9e0-48370a513a98', 'volType': 'path', 'leaseOffset': 0, 'volumeID': 'aa66d378-5a5f-490c-b0ab-993b79838d95', 'leasePath': '/rhev/data-center/mnt/nas01:_volume1_vm__os_ovirt36__engine/c6323975-2966-409d-b9e0-48370a513a98/images/8518ef4a-7b17-4291-856c-81875ba4e264/aa66d378-5a5f-490c-b0ab-993b79838d95.lease', 'imageID': '8518ef4a-7b17-4291-856c-81875ba4e264', 'path': '/rhev/data-center/mnt/nas01:_volume1_vm__os_ovirt36__engine/c6323975-2966-409d-b9e0-48370a513a98/images/8518ef4a-7b17-4291-856c-81875ba4e264/aa66d378-5a5f-490c-b0ab-993b79838d95'}]}, {'index': '2', 'iface': 'ide', 'name': 'hdc', 'alias': 'ide0-1-0', 'readonly': 'True', 'deviceId': '8c3179ac-b322-4f5c-9449-c52e3665e0ae', 'address': {'bus': '1', 'controller': '0', 'type': 'drive', 'target': '0', 'unit': '0'}, 'device': 'cdrom', 'shared': 'false', 'path': '', 'type': 'disk'}, {'device': 'unix', 'alias': 'channel0', 'type': 'channel', 'address': {'bus': '0', 'controller': '0', 'type': 'virtio-serial', 'port': '1'}}, {'device': 'unix', 'alias': 'channel1', 'type': 'channel', 'address': {'bus': '0', 'controller': '0', 'type': 'virtio-serial', 'port': '2'}}, {'device': 'unix', 'alias': 'channel2', 'type': 'channel', 'address': {'bus': '0', 'controller': '0', 'type': 'virtio-serial', 'port': '3'}}, {'device': '', 'alias': 'video0', 'type': 'video', 'address': {'slot': '0x02', 'bus': '0x00', 'domain': '0x0000', 'type': 'pci', 'function': '0x0'}}] guestDiskMapping = {'8518ef4a-7b17-4291-8': {'name': '/dev/vda'}, 'QEMU_DVD-ROM_QM00003': {'name': '/dev/sr0'}} vmType = kvm displaySecurePort = -1 memSize = 4096 displayPort = 5900 clientIp = spiceSecureChannels = smain,sdisplay,sinputs,scursor,splayback,srecord,ssmartcard,susbredir smp = 4 displayIp = 0 display = vnc pauseCode = NOERR On Mon, Jun 13, 2016 at 8:25 AM, Charles Kozler <charles@fixflyer.com> wrote:
It is up. I can do "ps -Aef | grep -i qemu-kvm | grep -i hosted" and see it running. I also forcefully shut it down with hosted-engine --vm-stop when it was on node1 and then did --vm-start on node 2 and it came up. Also the Web UI is reachable so thats how I also know the hosted engine VM is running
On Mon, Jun 13, 2016 at 8:24 AM, Alexis HAUSER < alexis.hauser@telecom-bretagne.eu> wrote:
I had similar errors with one single host and a hosted-engine VM. My case should be totally different, but one thing you could try first is to check VM is really up. In my issues, VM was shown by hosted-engine command as up, but was down. with vdsClient command, you can check if it's status with more details.
What is the result for you of the following command ?
vdsClient -s 0 list
--
*Charles Kozler* *Vice President, IT Operations*
FIX Flyer, LLC 225 Broadway | Suite 1600 | New York, NY 10007 1-888-349-3593 http://www.fixflyer.com <http://fixflyer.com>
NOTICE TO RECIPIENT: THIS E-MAIL IS MEANT ONLY FOR THE INTENDED RECIPIENT(S) OF THE TRANSMISSION, AND CONTAINS CONFIDENTIAL INFORMATION WHICH IS PROPRIETARY TO FIX FLYER LLC. ANY UNAUTHORIZED USE, COPYING, DISTRIBUTION, OR DISSEMINATION IS STRICTLY PROHIBITED. ALL RIGHTS TO THIS INFORMATION IS RESERVED BY FIX FLYER LLC. IF YOU ARE NOT THE INTENDED RECIPIENT, PLEASE CONTACT THE SENDER BY REPLY E-MAIL AND PLEASE DELETE THIS E-MAIL FROM YOUR SYSTEM AND DESTROY ANY COPIES.
-- *Charles Kozler* *Vice President, IT Operations* FIX Flyer, LLC 225 Broadway | Suite 1600 | New York, NY 10007 1-888-349-3593 http://www.fixflyer.com <http://fixflyer.com> NOTICE TO RECIPIENT: THIS E-MAIL IS MEANT ONLY FOR THE INTENDED RECIPIENT(S) OF THE TRANSMISSION, AND CONTAINS CONFIDENTIAL INFORMATION WHICH IS PROPRIETARY TO FIX FLYER LLC. ANY UNAUTHORIZED USE, COPYING, DISTRIBUTION, OR DISSEMINATION IS STRICTLY PROHIBITED. ALL RIGHTS TO THIS INFORMATION IS RESERVED BY FIX FLYER LLC. IF YOU ARE NOT THE INTENDED RECIPIENT, PLEASE CONTACT THE SENDER BY REPLY E-MAIL AND PLEASE DELETE THIS E-MAIL FROM YOUR SYSTEM AND DESTROY ANY COPIES.

This is a multi-part message in MIME format. --------------040304050504070605090208 Content-Type: text/plain; charset=windows-1252; format=flowed Content-Transfer-Encoding: 7bit Are the ovirt-ha-agent and ovirt-ha-broker services running on all the nodes? If they are, check the agent.log and broker.log for errors. On 06/14/2016 05:29 PM, Charles Kozler wrote:
Anyone have any other possible information on this? I've noticed this issue before and usually it just takes a bit of time for the cluster to 'settle' after some node reboots but its been a few days and its still marked as stale.
--== Host 1 status ==--
Status up-to-date : False Hostname : njsevcnp01 Host ID : 1 Engine status : unknown stale-data Score : 0 stopped : True Local maintenance : False crc32 : 260dbf06 Host timestamp : 327
--== Host 2 status ==--
Status up-to-date : False Hostname : njsevcnp02 Host ID : 2 Engine status : unknown stale-data Score : 0 stopped : True Local maintenance : False crc32 : 25da07df Host timestamp : 3030
--== Host 3 status ==--
Status up-to-date : False Hostname : njsevcnp03 Host ID : 3 Engine status : unknown stale-data Score : 0 stopped : True Local maintenance : False crc32 : c67818cb Host timestamp : 10877406
&& vdsClient on node2 showing hosted engine is up on node 2
48207078-8cb0-413c-8984-40aa772f4d94 Status = Up nicModel = rtl8139,pv statusTime = 4540044460 emulatedMachine = pc pid = 30571 vmName = HostedEngine devices = [{'device': 'memballoon', 'specParams': {'model': 'none'}, 'type': 'balloon', 'alias': 'balloon0'}, {'alias': 'scsi0', 'deviceId': '17f10db1-2e9e-4422-9ea5-61a628072e29', 'address': {'slot': '0x04', 'bus': '0x00', 'domain': '0x0000', 'type': 'pci', 'function': '0x0'}, 'device': 'scsi', 'model': 'virtio-scsi', 'type': 'controller'}, {'device': 'usb', 'alias': 'usb', 'type': 'controller', 'deviceId': '9be34ac0-7d00-4a95-bdfe-5b328fc1355b', 'address': {'slot': '0x01', 'bus': '0x00', 'domain': '0x0000', 'type': 'pci', 'function': '0x2'}}, {'device': 'ide', 'alias': 'ide', 'type': 'controller', 'deviceId': '222629a8-0dd6-4e8e-9b42-43aac314c0c2', 'address': {'slot': '0x01', 'bus': '0x00', 'domain': '0x0000', 'type': 'pci', 'function': '0x1'}}, {'device': 'virtio-serial', 'alias': 'virtio-serial0', 'type': 'controller', 'deviceId': '7cbccd04-853a-408f-94c2-5b10b641b7af', 'address': {'slot': '0x05', 'bus': '0x00', 'domain': '0x0000', 'type': 'pci', 'function': '0x0'}}, {'device': 'vnc', 'specParams': {'spiceSecureChannels': 'smain,sdisplay,sinputs,scursor,splayback,srecord,ssmartcard,susbredir', 'displayIp': '0'}, 'type': 'graphics', 'port': '5900'}, {'nicModel': 'pv', 'macAddr': '00:16:3e:16:83:91', 'linkActive': True, 'network': 'ovirtmgmt', 'alias': 'net0', 'deviceId': '3f679659-142c-41f3-a69d-4264d7234fbc', 'address': {'slot': '0x03', 'bus': '0x00', 'domain': '0x0000', 'type': 'pci', 'function': '0x0'}, 'device': 'bridge', 'type': 'interface', 'name': 'vnet0'}, {'address': {'slot': '0x06', 'bus': '0x00', 'domain': '0x0000', 'type': 'pci', 'function': '0x0'}, 'volumeInfo': {'domainID': 'c6323975-2966-409d-b9e0-48370a513a98', 'volType': 'path', 'leaseOffset': 0, 'volumeID': 'aa66d378-5a5f-490c-b0ab-993b79838d95', 'leasePath': '/rhev/data-center/mnt/nas01:_volume1_vm__os_ovirt36__engine/c6323975-2966-409d-b9e0-48370a513a98/images/8518ef4a-7b17-4291-856c-81875ba4e264/aa66d378-5a5f-490c-b0ab-993b79838d95.lease', 'imageID': '8518ef4a-7b17-4291-856c-81875ba4e264', 'path': '/rhev/data-center/mnt/nas01:_volume1_vm__os_ovirt36__engine/c6323975-2966-409d-b9e0-48370a513a98/images/8518ef4a-7b17-4291-856c-81875ba4e264/aa66d378-5a5f-490c-b0ab-993b79838d95'}, 'index': '0', 'iface': 'virtio', 'apparentsize': '10737418240', 'imageID': '8518ef4a-7b17-4291-856c-81875ba4e264', 'readonly': 'False', 'shared': 'exclusive', 'truesize': '6899802112', 'type': 'disk', 'domainID': 'c6323975-2966-409d-b9e0-48370a513a98', 'reqsize': '0', 'format': 'raw', 'deviceId': '8518ef4a-7b17-4291-856c-81875ba4e264', 'poolID': '00000000-0000-0000-0000-000000000000', 'device': 'disk', 'path': '/var/run/vdsm/storage/c6323975-2966-409d-b9e0-48370a513a98/8518ef4a-7b17-4291-856c-81875ba4e264/aa66d378-5a5f-490c-b0ab-993b79838d95', 'propagateErrors': 'off', 'name': 'vda', 'bootOrder': '1', 'volumeID': 'aa66d378-5a5f-490c-b0ab-993b79838d95', 'alias': 'virtio-disk0', 'volumeChain': [{'domainID': 'c6323975-2966-409d-b9e0-48370a513a98', 'volType': 'path', 'leaseOffset': 0, 'volumeID': 'aa66d378-5a5f-490c-b0ab-993b79838d95', 'leasePath': '/rhev/data-center/mnt/nas01:_volume1_vm__os_ovirt36__engine/c6323975-2966-409d-b9e0-48370a513a98/images/8518ef4a-7b17-4291-856c-81875ba4e264/aa66d378-5a5f-490c-b0ab-993b79838d95.lease', 'imageID': '8518ef4a-7b17-4291-856c-81875ba4e264', 'path': '/rhev/data-center/mnt/nas01:_volume1_vm__os_ovirt36__engine/c6323975-2966-409d-b9e0-48370a513a98/images/8518ef4a-7b17-4291-856c-81875ba4e264/aa66d378-5a5f-490c-b0ab-993b79838d95'}]}, {'index': '2', 'iface': 'ide', 'name': 'hdc', 'alias': 'ide0-1-0', 'readonly': 'True', 'deviceId': '8c3179ac-b322-4f5c-9449-c52e3665e0ae', 'address': {'bus': '1', 'controller': '0', 'type': 'drive', 'target': '0', 'unit': '0'}, 'device': 'cdrom', 'shared': 'false', 'path': '', 'type': 'disk'}, {'device': 'unix', 'alias': 'channel0', 'type': 'channel', 'address': {'bus': '0', 'controller': '0', 'type': 'virtio-serial', 'port': '1'}}, {'device': 'unix', 'alias': 'channel1', 'type': 'channel', 'address': {'bus': '0', 'controller': '0', 'type': 'virtio-serial', 'port': '2'}}, {'device': 'unix', 'alias': 'channel2', 'type': 'channel', 'address': {'bus': '0', 'controller': '0', 'type': 'virtio-serial', 'port': '3'}}, {'device': '', 'alias': 'video0', 'type': 'video', 'address': {'slot': '0x02', 'bus': '0x00', 'domain': '0x0000', 'type': 'pci', 'function': '0x0'}}] guestDiskMapping = {'8518ef4a-7b17-4291-8': {'name': '/dev/vda'}, 'QEMU_DVD-ROM_QM00003': {'name': '/dev/sr0'}} vmType = kvm displaySecurePort = -1 memSize = 4096 displayPort = 5900 clientIp = spiceSecureChannels = smain,sdisplay,sinputs,scursor,splayback,srecord,ssmartcard,susbredir smp = 4 displayIp = 0 display = vnc pauseCode = NOERR
On Mon, Jun 13, 2016 at 8:25 AM, Charles Kozler <charles@fixflyer.com <mailto:charles@fixflyer.com>> wrote:
It is up. I can do "ps -Aef | grep -i qemu-kvm | grep -i hosted" and see it running. I also forcefully shut it down with hosted-engine --vm-stop when it was on node1 and then did --vm-start on node 2 and it came up. Also the Web UI is reachable so thats how I also know the hosted engine VM is running
On Mon, Jun 13, 2016 at 8:24 AM, Alexis HAUSER <alexis.hauser@telecom-bretagne.eu <mailto:alexis.hauser@telecom-bretagne.eu>> wrote:
I had similar errors with one single host and a hosted-engine VM. My case should be totally different, but one thing you could try first is to check VM is really up. In my issues, VM was shown by hosted-engine command as up, but was down. with vdsClient command, you can check if it's status with more details.
What is the result for you of the following command ?
vdsClient -s 0 list
--
*Charles Kozler* /Vice President, IT Operations/ FIX Flyer, LLC 225 Broadway | Suite 1600 | New York, NY 10007 1-888-349-3593 <tel:1-888-349-3593> http://www.fixflyer.com <http://fixflyer.com> NOTICE TO RECIPIENT: THIS E-MAIL IS MEANT ONLY FOR THE INTENDED RECIPIENT(S) OF THE TRANSMISSION, AND CONTAINS CONFIDENTIAL INFORMATION WHICH IS PROPRIETARY TO FIX FLYER LLC. ANY UNAUTHORIZED USE, COPYING, DISTRIBUTION, OR DISSEMINATION IS STRICTLY PROHIBITED. ALL RIGHTS TO THIS INFORMATION IS RESERVED BY FIX FLYER LLC. IF YOU ARE NOT THE INTENDED RECIPIENT, PLEASE CONTACT THE SENDER BY REPLY E-MAIL AND PLEASE DELETE THIS E-MAIL FROM YOUR SYSTEM AND DESTROY ANY COPIES.
--
*Charles Kozler* /Vice President, IT Operations/ FIX Flyer, LLC 225 Broadway | Suite 1600 | New York, NY 10007 1-888-349-3593 http://www.fixflyer.com <http://fixflyer.com> NOTICE TO RECIPIENT: THIS E-MAIL IS MEANT ONLY FOR THE INTENDED RECIPIENT(S) OF THE TRANSMISSION, AND CONTAINS CONFIDENTIAL INFORMATION WHICH IS PROPRIETARY TO FIX FLYER LLC. ANY UNAUTHORIZED USE, COPYING, DISTRIBUTION, OR DISSEMINATION IS STRICTLY PROHIBITED. ALL RIGHTS TO THIS INFORMATION IS RESERVED BY FIX FLYER LLC. IF YOU ARE NOT THE INTENDED RECIPIENT, PLEASE CONTACT THE SENDER BY REPLY E-MAIL AND PLEASE DELETE THIS E-MAIL FROM YOUR SYSTEM AND DESTROY ANY COPIES.
_______________________________________________ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users
--------------040304050504070605090208 Content-Type: text/html; charset=windows-1252 Content-Transfer-Encoding: 8bit <html> <head> <meta content="text/html; charset=windows-1252" http-equiv="Content-Type"> </head> <body text="#000000" bgcolor="#FFFFFF"> Are the ovirt-ha-agent and ovirt-ha-broker services running on all the nodes? If they are, check the agent.log and broker.log for errors.<br> <br> <div class="moz-cite-prefix">On 06/14/2016 05:29 PM, Charles Kozler wrote:<br> </div> <blockquote cite="mid:CACiW5VJhRkJ8ZsWvm9u97Kp+5_iXmR6B4MkPoysiMgPvLPtRDA@mail.gmail.com" type="cite"> <div dir="ltr">Anyone have any other possible information on this? I've noticed this issue before and usually it just takes a bit of time for the cluster to 'settle' after some node reboots but its been a few days and its still marked as stale. <div><br> </div> <div> <div><br> </div> <div><br> </div> <div>--== Host 1 status ==--</div> <div><br> </div> <div>Status up-to-date : False</div> <div>Hostname : njsevcnp01</div> <div>Host ID : 1</div> <div>Engine status : unknown stale-data</div> <div>Score : 0</div> <div>stopped : True</div> <div>Local maintenance : False</div> <div>crc32 : 260dbf06</div> <div>Host timestamp : 327</div> <div><br> </div> <div><br> </div> <div>--== Host 2 status ==--</div> <div><br> </div> <div>Status up-to-date : False</div> <div>Hostname : njsevcnp02</div> <div>Host ID : 2</div> <div>Engine status : unknown stale-data</div> <div>Score : 0</div> <div>stopped : True</div> <div>Local maintenance : False</div> <div>crc32 : 25da07df</div> <div>Host timestamp : 3030</div> <div><br> </div> <div><br> </div> <div>--== Host 3 status ==--</div> <div><br> </div> <div>Status up-to-date : False</div> <div>Hostname : njsevcnp03</div> <div>Host ID : 3</div> <div>Engine status : unknown stale-data</div> <div>Score : 0</div> <div>stopped : True</div> <div>Local maintenance : False</div> <div>crc32 : c67818cb</div> <div>Host timestamp : 10877406</div> </div> <div><br> </div> <div><br> </div> <div>&& vdsClient on node2 showing hosted engine is up on node 2</div> <div><br> </div> <div> <div>48207078-8cb0-413c-8984-40aa772f4d94</div> <div><span class="" style="white-space:pre"> </span>Status = Up</div> <div><span class="" style="white-space:pre"> </span>nicModel = rtl8139,pv</div> <div><span class="" style="white-space:pre"> </span>statusTime = 4540044460</div> <div><span class="" style="white-space:pre"> </span>emulatedMachine = pc</div> <div><span class="" style="white-space:pre"> </span>pid = 30571</div> <div><span class="" style="white-space:pre"> </span>vmName = HostedEngine</div> <div><span class="" style="white-space:pre"> </span>devices = [{'device': 'memballoon', 'specParams': {'model': 'none'}, 'type': 'balloon', 'alias': 'balloon0'}, {'alias': 'scsi0', 'deviceId': '17f10db1-2e9e-4422-9ea5-61a628072e29', 'address': {'slot': '0x04', 'bus': '0x00', 'domain': '0x0000', 'type': 'pci', 'function': '0x0'}, 'device': 'scsi', 'model': 'virtio-scsi', 'type': 'controller'}, {'device': 'usb', 'alias': 'usb', 'type': 'controller', 'deviceId': '9be34ac0-7d00-4a95-bdfe-5b328fc1355b', 'address': {'slot': '0x01', 'bus': '0x00', 'domain': '0x0000', 'type': 'pci', 'function': '0x2'}}, {'device': 'ide', 'alias': 'ide', 'type': 'controller', 'deviceId': '222629a8-0dd6-4e8e-9b42-43aac314c0c2', 'address': {'slot': '0x01', 'bus': '0x00', 'domain': '0x0000', 'type': 'pci', 'function': '0x1'}}, {'device': 'virtio-serial', 'alias': 'virtio-serial0', 'type': 'controller', 'deviceId': '7cbccd04-853a-408f-94c2-5b10b641b7af', 'address': {'slot': '0x05', 'bus': '0x00', 'domain': '0x0000', 'type': 'pci', 'function': '0x0'}}, {'device': 'vnc', 'specParams': {'spiceSecureChannels': 'smain,sdisplay,sinputs,scursor,splayback,srecord,ssmartcard,susbredir', 'displayIp': '0'}, 'type': 'graphics', 'port': '5900'}, {'nicModel': 'pv', 'macAddr': '00:16:3e:16:83:91', 'linkActive': True, 'network': 'ovirtmgmt', 'alias': 'net0', 'deviceId': '3f679659-142c-41f3-a69d-4264d7234fbc', 'address': {'slot': '0x03', 'bus': '0x00', 'domain': '0x0000', 'type': 'pci', 'function': '0x0'}, 'device': 'bridge', 'type': 'interface', 'name': 'vnet0'}, {'address': {'slot': '0x06', 'bus': '0x00', 'domain': '0x0000', 'type': 'pci', 'function': '0x0'}, 'volumeInfo': {'domainID': 'c6323975-2966-409d-b9e0-48370a513a98', 'volType': 'path', 'leaseOffset': 0, 'volumeID': 'aa66d378-5a5f-490c-b0ab-993b79838d95', 'leasePath': '/rhev/data-center/mnt/nas01:_volume1_vm__os_ovirt36__engine/c6323975-2966-409d-b9e0-48370a513a98/images/8518ef4a-7b17-4291-856c-81875ba4e264/aa66d378-5a5f-490c-b0ab-993b79838d95.lease', 'imageID': '8518ef4a-7b17-4291-856c-81875ba4e264', 'path': '/rhev/data-center/mnt/nas01:_volume1_vm__os_ovirt36__engine/c6323975-2966-409d-b9e0-48370a513a98/images/8518ef4a-7b17-4291-856c-81875ba4e264/aa66d378-5a5f-490c-b0ab-993b79838d95'}, 'index': '0', 'iface': 'virtio', 'apparentsize': '10737418240', 'imageID': '8518ef4a-7b17-4291-856c-81875ba4e264', 'readonly': 'False', 'shared': 'exclusive', 'truesize': '6899802112', 'type': 'disk', 'domainID': 'c6323975-2966-409d-b9e0-48370a513a98', 'reqsize': '0', 'format': 'raw', 'deviceId': '8518ef4a-7b17-4291-856c-81875ba4e264', 'poolID': '00000000-0000-0000-0000-000000000000', 'device': 'disk', 'path': '/var/run/vdsm/storage/c6323975-2966-409d-b9e0-48370a513a98/8518ef4a-7b17-4291-856c-81875ba4e264/aa66d378-5a5f-490c-b0ab-993b79838d95', 'propagateErrors': 'off', 'name': 'vda', 'bootOrder': '1', 'volumeID': 'aa66d378-5a5f-490c-b0ab-993b79838d95', 'alias': 'virtio-disk0', 'volumeChain': [{'domainID': 'c6323975-2966-409d-b9e0-48370a513a98', 'volType': 'path', 'leaseOffset': 0, 'volumeID': 'aa66d378-5a5f-490c-b0ab-993b79838d95', 'leasePath': '/rhev/data-center/mnt/nas01:_volume1_vm__os_ovirt36__engine/c6323975-2966-409d-b9e0-48370a513a98/images/8518ef4a-7b17-4291-856c-81875ba4e264/aa66d378-5a5f-490c-b0ab-993b79838d95.lease', 'imageID': '8518ef4a-7b17-4291-856c-81875ba4e264', 'path': '/rhev/data-center/mnt/nas01:_volume1_vm__os_ovirt36__engine/c6323975-2966-409d-b9e0-48370a513a98/images/8518ef4a-7b17-4291-856c-81875ba4e264/aa66d378-5a5f-490c-b0ab-993b79838d95'}]}, {'index': '2', 'iface': 'ide', 'name': 'hdc', 'alias': 'ide0-1-0', 'readonly': 'True', 'deviceId': '8c3179ac-b322-4f5c-9449-c52e3665e0ae', 'address': {'bus': '1', 'controller': '0', 'type': 'drive', 'target': '0', 'unit': '0'}, 'device': 'cdrom', 'shared': 'false', 'path': '', 'type': 'disk'}, {'device': 'unix', 'alias': 'channel0', 'type': 'channel', 'address': {'bus': '0', 'controller': '0', 'type': 'virtio-serial', 'port': '1'}}, {'device': 'unix', 'alias': 'channel1', 'type': 'channel', 'address': {'bus': '0', 'controller': '0', 'type': 'virtio-serial', 'port': '2'}}, {'device': 'unix', 'alias': 'channel2', 'type': 'channel', 'address': {'bus': '0', 'controller': '0', 'type': 'virtio-serial', 'port': '3'}}, {'device': '', 'alias': 'video0', 'type': 'video', 'address': {'slot': '0x02', 'bus': '0x00', 'domain': '0x0000', 'type': 'pci', 'function': '0x0'}}]</div> <div><span class="" style="white-space:pre"> </span>guestDiskMapping = {'8518ef4a-7b17-4291-8': {'name': '/dev/vda'}, 'QEMU_DVD-ROM_QM00003': {'name': '/dev/sr0'}}</div> <div><span class="" style="white-space:pre"> </span>vmType = kvm</div> <div><span class="" style="white-space:pre"> </span>displaySecurePort = -1</div> <div><span class="" style="white-space:pre"> </span>memSize = 4096</div> <div><span class="" style="white-space:pre"> </span>displayPort = 5900</div> <div><span class="" style="white-space:pre"> </span>clientIp = </div> <div><span class="" style="white-space:pre"> </span>spiceSecureChannels = smain,sdisplay,sinputs,scursor,splayback,srecord,ssmartcard,susbredir</div> <div><span class="" style="white-space:pre"> </span>smp = 4</div> <div><span class="" style="white-space:pre"> </span>displayIp = 0</div> <div><span class="" style="white-space:pre"> </span>display = vnc</div> <div><span class="" style="white-space:pre"> </span>pauseCode = NOERR</div> </div> <div><br> </div> <div><br> </div> <div><br> </div> </div> <div class="gmail_extra"><br> <div class="gmail_quote">On Mon, Jun 13, 2016 at 8:25 AM, Charles Kozler <span dir="ltr"><<a moz-do-not-send="true" href="mailto:charles@fixflyer.com" target="_blank">charles@fixflyer.com</a>></span> wrote:<br> <blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"> <div dir="ltr">It is up. I can do "ps -Aef | grep -i qemu-kvm | grep -i hosted" and see it running. I also forcefully shut it down with hosted-engine --vm-stop when it was on node1 and then did --vm-start on node 2 and it came up. Also the Web UI is reachable so thats how I also know the hosted engine VM is running</div> <div class="gmail_extra"> <div> <div class="h5"><br> <div class="gmail_quote">On Mon, Jun 13, 2016 at 8:24 AM, Alexis HAUSER <span dir="ltr"><<a moz-do-not-send="true" href="mailto:alexis.hauser@telecom-bretagne.eu" target="_blank"><a class="moz-txt-link-abbreviated" href="mailto:alexis.hauser@telecom-bretagne.eu">alexis.hauser@telecom-bretagne.eu</a></a>></span> wrote:<br> <blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><br> > <a moz-do-not-send="true" href="http://imgur.com/a/6xkaS" rel="noreferrer" target="_blank">http://imgur.com/a/6xkaS</a><br> <br> I had similar errors with one single host and a hosted-engine VM.<br> My case should be totally different, but one thing you could try first is to check VM is really up.<br> In my issues, VM was shown by hosted-engine command as up, but was down. with vdsClient command, you can check if it's status with more details.<br> <br> What is the result for you of the following command ?<br> <br> vdsClient -s 0 list<br> </blockquote> </div> <br> <br clear="all"> <div><br> </div> </div> </div> <span class="">-- <br> <div data-smartmail="gmail_signature"> <div dir="ltr"> <div> <div dir="ltr"> <div> <div dir="ltr"> <div><br style="color:rgb(0,0,0);font-family:'Times New Roman';font-size:medium"> <div style="color:rgb(0,0,0);font-family:'Times New Roman'"><strong><span style="font-family:'times new roman',times,serif"><font size="2">Charles Kozler</font></span></strong></div> <div style="color:rgb(0,0,0);font-family:'Times New Roman'"><em><span style="font-family:'times new roman',times,serif"><font size="2">Vice President, IT Operations</font></span></em></div> <div style="color:rgb(0,0,0);font-family:'Times New Roman';font-size:medium"> </div> <div style="color:rgb(0,0,0);font-family:'Times New Roman'"><span style="font-family:'times new roman',times,serif"><font size="2">FIX Flyer, LLC</font></span></div> <div style="color:rgb(0,0,0);font-family:'Times New Roman'"><span style="font-family:'times new roman',times,serif"><font size="2">225 Broadway | Suite 1600 | New York, NY 10007</font></span></div> <div style="color:rgb(0,0,0);font-family:'Times New Roman'"><span style="font-family:'times new roman',times,serif"><font size="2"><a moz-do-not-send="true" href="tel:1-888-349-3593" value="+18883493593" target="_blank">1-888-349-3593</a></font></span></div> <div style="color:rgb(0,0,0);font-family:'Times New Roman'"><a moz-do-not-send="true" href="http://fixflyer.com" target="_blank"><span style="font-family:'times new roman',times,serif"><font size="2">http://www.fixflyer.com</font></span></a></div> <div style="color:rgb(0,0,0);font-family:'Times New Roman'"><font size="2"> </font></div> <div style="color:rgb(0,0,0);font-family:'Times New Roman'"><font size="1">NOTICE TO RECIPIENT: THIS E-MAIL IS MEANT ONLY FOR THE INTENDED RECIPIENT(S) OF THE TRANSMISSION, AND CONTAINS CONFIDENTIAL INFORMATION WHICH IS PROPRIETARY TO FIX FLYER LLC. ANY UNAUTHORIZED USE, COPYING, DISTRIBUTION, OR DISSEMINATION IS STRICTLY PROHIBITED. ALL RIGHTS TO THIS INFORMATION IS RESERVED BY FIX FLYER LLC. IF YOU ARE NOT THE INTENDED RECIPIENT, PLEASE CONTACT THE SENDER BY REPLY E-MAIL AND PLEASE DELETE THIS E-MAIL FROM YOUR SYSTEM AND DESTROY ANY COPIES.</font></div> </div> </div> </div> </div> </div> </div> </div> </span></div> </blockquote> </div> <br> <br clear="all"> <div><br> </div> -- <br> <div class="gmail_signature" data-smartmail="gmail_signature"> <div dir="ltr"> <div> <div dir="ltr"> <div> <div dir="ltr"> <div><br style="color:rgb(0,0,0);font-family:'Times New Roman';font-size:medium"> <div style="color:rgb(0,0,0);font-family:'Times New Roman'"><strong><span style="font-family:'times new roman',times,serif"><font size="2">Charles Kozler</font></span></strong></div> <div style="color:rgb(0,0,0);font-family:'Times New Roman'"><em><span style="font-family:'times new roman',times,serif"><font size="2">Vice President, IT Operations</font></span></em></div> <div style="color:rgb(0,0,0);font-family:'Times New Roman';font-size:medium"> </div> <div style="color:rgb(0,0,0);font-family:'Times New Roman'"><span style="font-family:'times new roman',times,serif"><font size="2">FIX Flyer, LLC</font></span></div> <div style="color:rgb(0,0,0);font-family:'Times New Roman'"><span style="font-family:'times new roman',times,serif"><font size="2">225 Broadway | Suite 1600 | New York, NY 10007</font></span></div> <div style="color:rgb(0,0,0);font-family:'Times New Roman'"><span style="font-family:'times new roman',times,serif"><font size="2">1-888-349-3593</font></span></div> <div style="color:rgb(0,0,0);font-family:'Times New Roman'"><a moz-do-not-send="true" href="http://fixflyer.com" target="_blank"><span style="font-family:'times new roman',times,serif"><font size="2">http://www.fixflyer.com</font></span></a></div> <div style="color:rgb(0,0,0);font-family:'Times New Roman'"><font size="2"> </font></div> <div style="color:rgb(0,0,0);font-family:'Times New Roman'"><font size="1">NOTICE TO RECIPIENT: THIS E-MAIL IS MEANT ONLY FOR THE INTENDED RECIPIENT(S) OF THE TRANSMISSION, AND CONTAINS CONFIDENTIAL INFORMATION WHICH IS PROPRIETARY TO FIX FLYER LLC. ANY UNAUTHORIZED USE, COPYING, DISTRIBUTION, OR DISSEMINATION IS STRICTLY PROHIBITED. ALL RIGHTS TO THIS INFORMATION IS RESERVED BY FIX FLYER LLC. IF YOU ARE NOT THE INTENDED RECIPIENT, PLEASE CONTACT THE SENDER BY REPLY E-MAIL AND PLEASE DELETE THIS E-MAIL FROM YOUR SYSTEM AND DESTROY ANY COPIES.</font></div> </div> </div> </div> </div> </div> </div> </div> </div> <br> <fieldset class="mimeAttachmentHeader"></fieldset> <br> <pre wrap="">_______________________________________________ Users mailing list <a class="moz-txt-link-abbreviated" href="mailto:Users@ovirt.org">Users@ovirt.org</a> <a class="moz-txt-link-freetext" href="http://lists.ovirt.org/mailman/listinfo/users">http://lists.ovirt.org/mailman/listinfo/users</a> </pre> </blockquote> <br> </body> </html> --------------040304050504070605090208--

Hi, is there anything interesting in the hosted engine log files? /var/log/ovirt-hosted-engine-ha/agent.log There should be something appearing there every 10 seconds or faster. What is the status of the hosted engine services? systemctl status ovirt-ha-agent ovirt-ha-broker Best regards -- Martin Sivak SLA / oVirt On Sat, Jun 11, 2016 at 8:53 PM, Charles Kozler <charles@fixflyer.com> wrote:
See linked images please. As you can see all three nodes are reporting stale data. The results of this are:
1. Not all VM's migrate seamlessly in the cluster. Sometimes I have to shut them down to get them to be able to migrate again
2. Hosted engine refuses to move due to constraints (image). This part doesnt make sense to me because I can forcefully shut it down and then go directly on a hosted engine node and bring it back up. Also, the Web UI shows all nodes under the cluster except then it thinks its not apart of the cluster
3. Time is in sync (image)
4. Storage is 100% fine. Gluster back end reports mirroring and status 'started'. No split brain has occurred and ovirt nodes have never lost connectivity to storage
5. I reinstalled all three nodes. For some reason only node 3 still shows as having updates available. (image). For clarity, I did not click "upgrade" I simply did 'reinstall' from the Web UI. Having looked at the output and yum.log from /var/log it almost looks like it did do an update. All package versions across all three nodes are the same (respective to ovirt/vdsm) (image). For some reason though ovirt-engine-appliance-3.6-20160126.1.el7.centos.noarch exists on node 1 but not on node 2 or 3. Could this be relative? I dont recall installing that specifically on node 1 but I may have
Been slamming my head on this so I am hoping you can provide some assistance
Thanks!
--
Charles Kozler Vice President, IT Operations
FIX Flyer, LLC 225 Broadway | Suite 1600 | New York, NY 10007 1-888-349-3593 http://www.fixflyer.com
NOTICE TO RECIPIENT: THIS E-MAIL IS MEANT ONLY FOR THE INTENDED RECIPIENT(S) OF THE TRANSMISSION, AND CONTAINS CONFIDENTIAL INFORMATION WHICH IS PROPRIETARY TO FIX FLYER LLC. ANY UNAUTHORIZED USE, COPYING, DISTRIBUTION, OR DISSEMINATION IS STRICTLY PROHIBITED. ALL RIGHTS TO THIS INFORMATION IS RESERVED BY FIX FLYER LLC. IF YOU ARE NOT THE INTENDED RECIPIENT, PLEASE CONTACT THE SENDER BY REPLY E-MAIL AND PLEASE DELETE THIS E-MAIL FROM YOUR SYSTEM AND DESTROY ANY COPIES.
_______________________________________________ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users

Martin - One thing I noticed on all of the nodes is this: Jun 14 08:11:11 njsevcnp01 ovirt-ha-agent[15713]: ovirt-ha-agent ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink ERROR Connection closed: Connection timed out Jun 14 08:11:11 njsevcnp01.fixflyer.com ovirt-ha-agent[15713]: ovirt-ha-agent ovirt_hosted_engine_ha.agent.agent.Agent ERROR Error: 'Failed to start monitor <type 'type'>, options {'hostname': 'njsevcnp01'}: Connection timed out' - trying to restart agent Then the agent is restarted [root@njsevcnp01 ~]# ps -Aef | grep -i ovirt-ha-agent | grep -iv grep vdsm 15713 1 0 08:09 ? 00:00:01 /usr/bin/python /usr/share/ovirt-hosted-engine-ha/ovirt-ha-agent --no-daemon I dont know why the connection would time out because as you can see that log is from node01 and I cant figure out why its timing out on the connection The other interesting thing is this log from node01. The odd thing is that it seems there is some split brain somewhere in oVirt because this log is from node02 but it is asking the engine and its getting back "vm not running on this host' rather than 'stale data'. But I dont know engine internals MainThread::INFO::2016-06-14 08:13:05,163::state_machine::171::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(refresh) Host njsevcnp02 (id 2): {hostname: njsevcnp02, host-id: 2, engine-status: {reason: vm not running on this host, health: bad, vm: down, detail: unknown}, score: 0, stopped: True, maintenance: False, crc32: 25da07df, host-ts: 3030} MainThread::INFO::2016-06-14 08:13:05,163::state_machine::171::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(refresh) Host njsevcnp03 (id 3): {hostname: njsevcnp03, host-id: 3, engine-status: {reason: vm not running on this host, health: bad, vm: down, detail: unknown}, score: 0, stopped: True, maintenance: False, crc32: c67818cb, host-ts: 10877406} And that same log on node02 where the engine is running MainThread::INFO::2016-06-14 08:15:44,451::state_machine::171::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(refresh) Host njsevcnp01 (id 1): {hostname: njsevcnp01, host-id: 1, engine-status: {reason: vm not running on this host, health: bad, vm: down, detail: unknown}, score: 0, stopped: True, maintenance: False, crc32: 260dbf06, host-ts: 327} MainThread::INFO::2016-06-14 08:15:44,451::state_machine::171::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(refresh) Host njsevcnp03 (id 3): {hostname: njsevcnp03, host-id: 3, engine-status: {reason: vm not running on this host, health: bad, vm: down, detail: unknown}, score: 0, stopped: True, maintenance: False, crc32: c67818cb, host-ts: 10877406} MainThread::INFO::2016-06-14 08:15:44,451::state_machine::174::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(refresh) Local (id 2): {engine-health: {health: good, vm: up, detail: up}, bridge: True, mem-free: 20702.0, maintenance: False, cpu-load: None, gateway: True} MainThread::INFO::2016-06-14 08:15:44,452::brokerlink::111::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify) Trying: notify time=1465906544.45 type=state_transition detail=StartState-ReinitializeFSM hostname=njsevcnp02 On Tue, Jun 14, 2016 at 7:59 AM, Martin Sivak <msivak@redhat.com> wrote:
Hi,
is there anything interesting in the hosted engine log files? /var/log/ovirt-hosted-engine-ha/agent.log
There should be something appearing there every 10 seconds or faster.
What is the status of the hosted engine services? systemctl status ovirt-ha-agent ovirt-ha-broker
Best regards
-- Martin Sivak SLA / oVirt
See linked images please. As you can see all three nodes are reporting stale data. The results of this are:
1. Not all VM's migrate seamlessly in the cluster. Sometimes I have to shut them down to get them to be able to migrate again
2. Hosted engine refuses to move due to constraints (image). This part doesnt make sense to me because I can forcefully shut it down and then go directly on a hosted engine node and bring it back up. Also, the Web UI shows all nodes under the cluster except then it thinks its not apart of
cluster
3. Time is in sync (image)
4. Storage is 100% fine. Gluster back end reports mirroring and status 'started'. No split brain has occurred and ovirt nodes have never lost connectivity to storage
5. I reinstalled all three nodes. For some reason only node 3 still shows as having updates available. (image). For clarity, I did not click "upgrade" I simply did 'reinstall' from the Web UI. Having looked at the output and yum.log from /var/log it almost looks like it did do an update. All
On Sat, Jun 11, 2016 at 8:53 PM, Charles Kozler <charles@fixflyer.com> wrote: the package
versions across all three nodes are the same (respective to ovirt/vdsm) (image). For some reason though ovirt-engine-appliance-3.6-20160126.1.el7.centos.noarch exists on node 1 but not on node 2 or 3. Could this be relative? I dont recall installing that specifically on node 1 but I may have
Been slamming my head on this so I am hoping you can provide some assistance
Thanks!
--
Charles Kozler Vice President, IT Operations
FIX Flyer, LLC 225 Broadway | Suite 1600 | New York, NY 10007 1-888-349-3593 http://www.fixflyer.com
NOTICE TO RECIPIENT: THIS E-MAIL IS MEANT ONLY FOR THE INTENDED RECIPIENT(S) OF THE TRANSMISSION, AND CONTAINS CONFIDENTIAL INFORMATION WHICH IS PROPRIETARY TO FIX FLYER LLC. ANY UNAUTHORIZED USE, COPYING, DISTRIBUTION, OR DISSEMINATION IS STRICTLY PROHIBITED. ALL RIGHTS TO THIS INFORMATION IS RESERVED BY FIX FLYER LLC. IF YOU ARE NOT THE INTENDED RECIPIENT, PLEASE CONTACT THE SENDER BY REPLY E-MAIL AND PLEASE DELETE THIS E-MAIL FROM YOUR SYSTEM AND DESTROY ANY COPIES.
_______________________________________________ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users
-- *Charles Kozler* *Vice President, IT Operations* FIX Flyer, LLC 225 Broadway | Suite 1600 | New York, NY 10007 1-888-349-3593 http://www.fixflyer.com <http://fixflyer.com> NOTICE TO RECIPIENT: THIS E-MAIL IS MEANT ONLY FOR THE INTENDED RECIPIENT(S) OF THE TRANSMISSION, AND CONTAINS CONFIDENTIAL INFORMATION WHICH IS PROPRIETARY TO FIX FLYER LLC. ANY UNAUTHORIZED USE, COPYING, DISTRIBUTION, OR DISSEMINATION IS STRICTLY PROHIBITED. ALL RIGHTS TO THIS INFORMATION IS RESERVED BY FIX FLYER LLC. IF YOU ARE NOT THE INTENDED RECIPIENT, PLEASE CONTACT THE SENDER BY REPLY E-MAIL AND PLEASE DELETE THIS E-MAIL FROM YOUR SYSTEM AND DESTROY ANY COPIES.

Jun 14 08:11:11 njsevcnp01 ovirt-ha-agent[15713]: ovirt-ha-agent ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink ERROR Connection closed: Connection timed out Jun 14 08:11:11 njsevcnp01.fixflyer.com ovirt-ha-agent[15713]: ovirt-ha-agent ovirt_hosted_engine_ha.agent.agent.Agent ERROR Error: 'Failed to start monitor <type 'type'>, options {'hostname': 'njsevcnp01'}: Connection timed out' - trying to restart agent
Broker is broken or down. Check the status of ovirt-ha-broker service.
The other interesting thing is this log from node01. The odd thing is that it seems there is some split brain somewhere in oVirt because this log is from node02 but it is asking the engine and its getting back "vm not running on this host' rather than 'stale data'. But I dont know engine internals
This is another piece that points to broker or storage issues. Agent collects local data and then publishes them to other nodes through broker. So it is possible for the agent to know the status of the VM locally, but not be able to publish it. hosted-engine command line tool then reads the synchronization whiteboard too, but it does not see anything that was not published and ends up reporting stale data.
What is the status of the hosted engine services? systemctl status ovirt-ha-agent ovirt-ha-broker
Please check the services. Best regards Martin On Tue, Jun 14, 2016 at 2:16 PM, Charles Kozler <charles@fixflyer.com> wrote:
Martin -
One thing I noticed on all of the nodes is this:
Jun 14 08:11:11 njsevcnp01 ovirt-ha-agent[15713]: ovirt-ha-agent ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink ERROR Connection closed: Connection timed out Jun 14 08:11:11 njsevcnp01.fixflyer.com ovirt-ha-agent[15713]: ovirt-ha-agent ovirt_hosted_engine_ha.agent.agent.Agent ERROR Error: 'Failed to start monitor <type 'type'>, options {'hostname': 'njsevcnp01'}: Connection timed out' - trying to restart agent
Then the agent is restarted
[root@njsevcnp01 ~]# ps -Aef | grep -i ovirt-ha-agent | grep -iv grep vdsm 15713 1 0 08:09 ? 00:00:01 /usr/bin/python /usr/share/ovirt-hosted-engine-ha/ovirt-ha-agent --no-daemon
I dont know why the connection would time out because as you can see that log is from node01 and I cant figure out why its timing out on the connection
The other interesting thing is this log from node01. The odd thing is that it seems there is some split brain somewhere in oVirt because this log is from node02 but it is asking the engine and its getting back "vm not running on this host' rather than 'stale data'. But I dont know engine internals
MainThread::INFO::2016-06-14 08:13:05,163::state_machine::171::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(refresh) Host njsevcnp02 (id 2): {hostname: njsevcnp02, host-id: 2, engine-status: {reason: vm not running on this host, health: bad, vm: down, detail: unknown}, score: 0, stopped: True, maintenance: False, crc32: 25da07df, host-ts: 3030} MainThread::INFO::2016-06-14 08:13:05,163::state_machine::171::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(refresh) Host njsevcnp03 (id 3): {hostname: njsevcnp03, host-id: 3, engine-status: {reason: vm not running on this host, health: bad, vm: down, detail: unknown}, score: 0, stopped: True, maintenance: False, crc32: c67818cb, host-ts: 10877406}
And that same log on node02 where the engine is running
MainThread::INFO::2016-06-14 08:15:44,451::state_machine::171::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(refresh) Host njsevcnp01 (id 1): {hostname: njsevcnp01, host-id: 1, engine-status: {reason: vm not running on this host, health: bad, vm: down, detail: unknown}, score: 0, stopped: True, maintenance: False, crc32: 260dbf06, host-ts: 327} MainThread::INFO::2016-06-14 08:15:44,451::state_machine::171::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(refresh) Host njsevcnp03 (id 3): {hostname: njsevcnp03, host-id: 3, engine-status: {reason: vm not running on this host, health: bad, vm: down, detail: unknown}, score: 0, stopped: True, maintenance: False, crc32: c67818cb, host-ts: 10877406} MainThread::INFO::2016-06-14 08:15:44,451::state_machine::174::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(refresh) Local (id 2): {engine-health: {health: good, vm: up, detail: up}, bridge: True, mem-free: 20702.0, maintenance: False, cpu-load: None, gateway: True} MainThread::INFO::2016-06-14 08:15:44,452::brokerlink::111::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify) Trying: notify time=1465906544.45 type=state_transition detail=StartState-ReinitializeFSM hostname=njsevcnp02
On Tue, Jun 14, 2016 at 7:59 AM, Martin Sivak <msivak@redhat.com> wrote:
Hi,
is there anything interesting in the hosted engine log files? /var/log/ovirt-hosted-engine-ha/agent.log
There should be something appearing there every 10 seconds or faster.
What is the status of the hosted engine services? systemctl status ovirt-ha-agent ovirt-ha-broker
Best regards
-- Martin Sivak SLA / oVirt
On Sat, Jun 11, 2016 at 8:53 PM, Charles Kozler <charles@fixflyer.com> wrote:
See linked images please. As you can see all three nodes are reporting stale data. The results of this are:
1. Not all VM's migrate seamlessly in the cluster. Sometimes I have to shut them down to get them to be able to migrate again
2. Hosted engine refuses to move due to constraints (image). This part doesnt make sense to me because I can forcefully shut it down and then go directly on a hosted engine node and bring it back up. Also, the Web UI shows all nodes under the cluster except then it thinks its not apart of the cluster
3. Time is in sync (image)
4. Storage is 100% fine. Gluster back end reports mirroring and status 'started'. No split brain has occurred and ovirt nodes have never lost connectivity to storage
5. I reinstalled all three nodes. For some reason only node 3 still shows as having updates available. (image). For clarity, I did not click "upgrade" I simply did 'reinstall' from the Web UI. Having looked at the output and yum.log from /var/log it almost looks like it did do an update. All package versions across all three nodes are the same (respective to ovirt/vdsm) (image). For some reason though ovirt-engine-appliance-3.6-20160126.1.el7.centos.noarch exists on node 1 but not on node 2 or 3. Could this be relative? I dont recall installing that specifically on node 1 but I may have
Been slamming my head on this so I am hoping you can provide some assistance
Thanks!
--
Charles Kozler Vice President, IT Operations
FIX Flyer, LLC 225 Broadway | Suite 1600 | New York, NY 10007 1-888-349-3593 http://www.fixflyer.com
NOTICE TO RECIPIENT: THIS E-MAIL IS MEANT ONLY FOR THE INTENDED RECIPIENT(S) OF THE TRANSMISSION, AND CONTAINS CONFIDENTIAL INFORMATION WHICH IS PROPRIETARY TO FIX FLYER LLC. ANY UNAUTHORIZED USE, COPYING, DISTRIBUTION, OR DISSEMINATION IS STRICTLY PROHIBITED. ALL RIGHTS TO THIS INFORMATION IS RESERVED BY FIX FLYER LLC. IF YOU ARE NOT THE INTENDED RECIPIENT, PLEASE CONTACT THE SENDER BY REPLY E-MAIL AND PLEASE DELETE THIS E-MAIL FROM YOUR SYSTEM AND DESTROY ANY COPIES.
_______________________________________________ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users
--
Charles Kozler Vice President, IT Operations
FIX Flyer, LLC 225 Broadway | Suite 1600 | New York, NY 10007 1-888-349-3593 http://www.fixflyer.com
NOTICE TO RECIPIENT: THIS E-MAIL IS MEANT ONLY FOR THE INTENDED RECIPIENT(S) OF THE TRANSMISSION, AND CONTAINS CONFIDENTIAL INFORMATION WHICH IS PROPRIETARY TO FIX FLYER LLC. ANY UNAUTHORIZED USE, COPYING, DISTRIBUTION, OR DISSEMINATION IS STRICTLY PROHIBITED. ALL RIGHTS TO THIS INFORMATION IS RESERVED BY FIX FLYER LLC. IF YOU ARE NOT THE INTENDED RECIPIENT, PLEASE CONTACT THE SENDER BY REPLY E-MAIL AND PLEASE DELETE THIS E-MAIL FROM YOUR SYSTEM AND DESTROY ANY COPIES.

Actually, broker is the only thing acting "right" between broker and agent. Broker is up when I bring the system up but agent is restarting all the time. Have a look The 11th is when I restarted this node after doing 'reinstall' in the web UI ● ovirt-ha-broker.service - oVirt Hosted Engine High Availability Communications Broker Loaded: loaded (/usr/lib/systemd/system/ovirt-ha-broker.service; enabled; vendor preset: disabled) Active: active (running) since Sat 2016-06-11 13:09:51 EDT; 3 days ago Main PID: 1285 (ovirt-ha-broker) CGroup: /system.slice/ovirt-ha-broker.service └─1285 /usr/bin/python /usr/share/ovirt-hosted-engine-ha/ovirt-ha-broker --no-daemon Jun 15 09:23:56 njsevcnp01 ovirt-ha-broker[1285]: INFO:mgmt_bridge.MgmtBridge:Found bridge ovirtmgmt with ports Jun 15 09:23:58 njsevcnp01 ovirt-ha-broker[1285]: INFO:ovirt_hosted_engine_ha.broker.listener.ConnectionHandler:Connection established Jun 15 09:23:58 njsevcnp01 ovirt-ha-broker[1285]: INFO:ovirt_hosted_engine_ha.broker.listener.ConnectionHandler:Connection closed Jun 15 09:23:58 njsevcnp01 ovirt-ha-broker[1285]: INFO:ovirt_hosted_engine_ha.broker.listener.ConnectionHandler:Connection established Jun 15 09:23:58 njsevcnp01 ovirt-ha-broker[1285]: INFO:ovirt_hosted_engine_ha.broker.listener.ConnectionHandler:Connection closed Jun 15 09:23:58 njsevcnp01 ovirt-ha-broker[1285]: INFO:ovirt_hosted_engine_ha.broker.listener.ConnectionHandler:Connection established Jun 15 09:23:58 njsevcnp01 ovirt-ha-broker[1285]: INFO:ovirt_hosted_engine_ha.broker.listener.ConnectionHandler:Connection closed Jun 15 09:23:58 njsevcnp01 ovirt-ha-broker[1285]: INFO:ovirt_hosted_engine_ha.broker.listener.ConnectionHandler:Connection established Jun 15 09:23:58 njsevcnp01 ovirt-ha-broker[1285]: INFO:ovirt_hosted_engine_ha.broker.listener.ConnectionHandler:Connection closed Jun 15 09:23:58 njsevcnp01 ovirt-ha-broker[1285]: INFO:mem_free.MemFree:memFree: 26408 Uptime of proc .. # ps -Aef | grep -i broker vdsm 1285 1 2 Jun11 ? 02:27:50 /usr/bin/python /usr/share/ovirt-hosted-engine-ha/ovirt-ha-broker --no-daemon But the agent... is restarting all the time # ps -Aef | grep -i ovirt-ha-agent vdsm 76116 1 0 09:19 ? 00:00:01 /usr/bin/python /usr/share/ovirt-hosted-engine-ha/ovirt-ha-agent --no-daemon 9:19 AM ET is last restart. Even the logs say it [root@njsevcnp01 ovirt-hosted-engine-ha]# grep -i 'restarting agent' agent.log | wc -l 232719 And the restarts every [root@njsevcnp01 ovirt-hosted-engine-ha]# tail -n 300 agent.log | grep -i 'restarting agent' MainThread::WARNING::2016-06-15 09:23:53,029::agent::208::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent) Restarting agent, attempt '6' MainThread::WARNING::2016-06-15 09:24:28,953::agent::208::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent) Restarting agent, attempt '7' MainThread::WARNING::2016-06-15 09:25:04,879::agent::208::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent) Restarting agent, attempt '8' MainThread::WARNING::2016-06-15 09:25:40,790::agent::208::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent) Restarting agent, attempt '9' MainThread::WARNING::2016-06-15 09:26:17,136::agent::208::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent) Restarting agent, attempt '0' MainThread::WARNING::2016-06-15 09:26:53,063::agent::208::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent) Restarting agent, attempt '1' Full log of restart is like this saying "connection timed out" but its not saying to *what* is timing out, so I have nothing else to really go on here [root@njsevcnp01 ovirt-hosted-engine-ha]# tail -n 300 agent.log | grep -i restart MainThread::ERROR::2016-06-15 09:24:23,948::agent::205::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent) Error: 'Failed to start monitor <type 'type'>, options {'hostname': 'njsevcnp01'}: Connection timed out' - trying to restart agent MainThread::WARNING::2016-06-15 09:24:28,953::agent::208::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent) Restarting agent, attempt '7' MainThread::ERROR::2016-06-15 09:24:59,874::agent::205::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent) Error: 'Failed to start monitor <type 'type'>, options {'hostname': 'njsevcnp01'}: Connection timed out' - trying to restart agent MainThread::WARNING::2016-06-15 09:25:04,879::agent::208::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent) Restarting agent, attempt '8' MainThread::ERROR::2016-06-15 09:25:35,785::agent::205::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent) Error: 'Failed to start monitor <type 'type'>, options {'hostname': 'njsevcnp01'}: Connection timed out' - trying to restart agent MainThread::WARNING::2016-06-15 09:25:40,790::agent::208::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent) Restarting agent, attempt '9' MainThread::ERROR::2016-06-15 09:26:12,131::agent::205::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent) Error: 'Failed to start monitor <type 'type'>, options {'hostname': 'njsevcnp01'}: Connection timed out' - trying to restart agent MainThread::WARNING::2016-06-15 09:26:17,136::agent::208::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent) Restarting agent, attempt '0' MainThread::ERROR::2016-06-15 09:26:48,058::agent::205::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent) Error: 'Failed to start monitor <type 'type'>, options {'hostname': 'njsevcnp01'}: Connection timed out' - trying to restart agent MainThread::WARNING::2016-06-15 09:26:53,063::agent::208::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent) Restarting agent, attempt '1' MainThread::ERROR::2016-06-15 09:27:23,969::agent::205::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent) Error: 'Failed to start monitor <type 'type'>, options {'hostname': 'njsevcnp01'}: Connection timed out' - trying to restart agent MainThread::WARNING::2016-06-15 09:27:28,973::agent::208::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent) Restarting agent, attempt '2' Storage is also completely fine. No logs stating anything "going away" or having issues. Engine has dedicated NFS NAS device meanwhile VM storage is completely separate storage cluster. Storage has 100% dedicated backend network with no changes being done On Wed, Jun 15, 2016 at 7:42 AM, Martin Sivak <msivak@redhat.com> wrote:
Jun 14 08:11:11 njsevcnp01 ovirt-ha-agent[15713]: ovirt-ha-agent ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink ERROR Connection closed: Connection timed out Jun 14 08:11:11 njsevcnp01.fixflyer.com ovirt-ha-agent[15713]: ovirt-ha-agent ovirt_hosted_engine_ha.agent.agent.Agent ERROR Error: 'Failed to start monitor <type 'type'>, options {'hostname': 'njsevcnp01'}: Connection timed out' - trying to restart agent
Broker is broken or down. Check the status of ovirt-ha-broker service.
The other interesting thing is this log from node01. The odd thing is that it seems there is some split brain somewhere in oVirt because this log is from node02 but it is asking the engine and its getting back "vm not running on this host' rather than 'stale data'. But I dont know engine internals
This is another piece that points to broker or storage issues. Agent collects local data and then publishes them to other nodes through broker. So it is possible for the agent to know the status of the VM locally, but not be able to publish it.
hosted-engine command line tool then reads the synchronization whiteboard too, but it does not see anything that was not published and ends up reporting stale data.
What is the status of the hosted engine services? systemctl status ovirt-ha-agent ovirt-ha-broker
Please check the services.
Best regards
Martin
Martin -
One thing I noticed on all of the nodes is this:
Jun 14 08:11:11 njsevcnp01 ovirt-ha-agent[15713]: ovirt-ha-agent ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink ERROR Connection closed: Connection timed out Jun 14 08:11:11 njsevcnp01.fixflyer.com ovirt-ha-agent[15713]: ovirt-ha-agent ovirt_hosted_engine_ha.agent.agent.Agent ERROR Error: 'Failed to start monitor <type 'type'>, options {'hostname': 'njsevcnp01'}: Connection timed out' - trying to restart agent
Then the agent is restarted
[root@njsevcnp01 ~]# ps -Aef | grep -i ovirt-ha-agent | grep -iv grep vdsm 15713 1 0 08:09 ? 00:00:01 /usr/bin/python /usr/share/ovirt-hosted-engine-ha/ovirt-ha-agent --no-daemon
I dont know why the connection would time out because as you can see that log is from node01 and I cant figure out why its timing out on the connection
The other interesting thing is this log from node01. The odd thing is
On Tue, Jun 14, 2016 at 2:16 PM, Charles Kozler <charles@fixflyer.com> wrote: that
it seems there is some split brain somewhere in oVirt because this log is from node02 but it is asking the engine and its getting back "vm not running on this host' rather than 'stale data'. But I dont know engine internals
MainThread::INFO::2016-06-14
08:13:05,163::state_machine::171::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(refresh)
Host njsevcnp02 (id 2): {hostname: njsevcnp02, host-id: 2, engine-status: {reason: vm not running on this host, health: bad, vm: down, detail: unknown}, score: 0, stopped: True, maintenance: False, crc32: 25da07df, host-ts: 3030} MainThread::INFO::2016-06-14
08:13:05,163::state_machine::171::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(refresh)
Host njsevcnp03 (id 3): {hostname: njsevcnp03, host-id: 3, engine-status: {reason: vm not running on this host, health: bad, vm: down, detail: unknown}, score: 0, stopped: True, maintenance: False, crc32: c67818cb, host-ts: 10877406}
And that same log on node02 where the engine is running
MainThread::INFO::2016-06-14
08:15:44,451::state_machine::171::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(refresh)
Host njsevcnp01 (id 1): {hostname: njsevcnp01, host-id: 1, engine-status: {reason: vm not running on this host, health: bad, vm: down, detail: unknown}, score: 0, stopped: True, maintenance: False, crc32: 260dbf06, host-ts: 327} MainThread::INFO::2016-06-14
08:15:44,451::state_machine::171::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(refresh)
Host njsevcnp03 (id 3): {hostname: njsevcnp03, host-id: 3, engine-status: {reason: vm not running on this host, health: bad, vm: down, detail: unknown}, score: 0, stopped: True, maintenance: False, crc32: c67818cb, host-ts: 10877406} MainThread::INFO::2016-06-14
08:15:44,451::state_machine::174::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(refresh)
Local (id 2): {engine-health: {health: good, vm: up, detail: up}, bridge: True, mem-free: 20702.0, maintenance: False, cpu-load: None, gateway: True} MainThread::INFO::2016-06-14
Trying: notify time=1465906544.45 type=state_transition detail=StartState-ReinitializeFSM hostname=njsevcnp02
On Tue, Jun 14, 2016 at 7:59 AM, Martin Sivak <msivak@redhat.com> wrote:
Hi,
is there anything interesting in the hosted engine log files? /var/log/ovirt-hosted-engine-ha/agent.log
There should be something appearing there every 10 seconds or faster.
What is the status of the hosted engine services? systemctl status ovirt-ha-agent ovirt-ha-broker
Best regards
-- Martin Sivak SLA / oVirt
On Sat, Jun 11, 2016 at 8:53 PM, Charles Kozler <charles@fixflyer.com> wrote:
See linked images please. As you can see all three nodes are reporting stale data. The results of this are:
1. Not all VM's migrate seamlessly in the cluster. Sometimes I have to shut them down to get them to be able to migrate again
2. Hosted engine refuses to move due to constraints (image). This part doesnt make sense to me because I can forcefully shut it down and
08:15:44,452::brokerlink::111::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify) then
go directly on a hosted engine node and bring it back up. Also, the Web UI shows all nodes under the cluster except then it thinks its not apart of the cluster
3. Time is in sync (image)
4. Storage is 100% fine. Gluster back end reports mirroring and status 'started'. No split brain has occurred and ovirt nodes have never lost connectivity to storage
5. I reinstalled all three nodes. For some reason only node 3 still shows as having updates available. (image). For clarity, I did not click "upgrade" I simply did 'reinstall' from the Web UI. Having looked at the output and yum.log from /var/log it almost looks like it did do an update. All package versions across all three nodes are the same (respective to ovirt/vdsm) (image). For some reason though ovirt-engine-appliance-3.6-20160126.1.el7.centos.noarch exists on node 1 but not on node 2 or 3. Could this be relative? I dont recall installing that specifically on node 1 but I may have
Been slamming my head on this so I am hoping you can provide some assistance
Thanks!
--
Charles Kozler Vice President, IT Operations
FIX Flyer, LLC 225 Broadway | Suite 1600 | New York, NY 10007 1-888-349-3593 http://www.fixflyer.com
NOTICE TO RECIPIENT: THIS E-MAIL IS MEANT ONLY FOR THE INTENDED RECIPIENT(S) OF THE TRANSMISSION, AND CONTAINS CONFIDENTIAL INFORMATION WHICH IS PROPRIETARY TO FIX FLYER LLC. ANY UNAUTHORIZED USE, COPYING, DISTRIBUTION, OR DISSEMINATION IS STRICTLY PROHIBITED. ALL RIGHTS TO THIS INFORMATION IS RESERVED BY FIX FLYER LLC. IF YOU ARE NOT THE INTENDED RECIPIENT, PLEASE CONTACT THE SENDER BY REPLY E-MAIL AND PLEASE DELETE THIS E-MAIL FROM YOUR SYSTEM AND DESTROY ANY COPIES.
_______________________________________________ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users
--
Charles Kozler Vice President, IT Operations
FIX Flyer, LLC 225 Broadway | Suite 1600 | New York, NY 10007 1-888-349-3593 http://www.fixflyer.com
NOTICE TO RECIPIENT: THIS E-MAIL IS MEANT ONLY FOR THE INTENDED RECIPIENT(S) OF THE TRANSMISSION, AND CONTAINS CONFIDENTIAL INFORMATION WHICH IS PROPRIETARY TO FIX FLYER LLC. ANY UNAUTHORIZED USE, COPYING, DISTRIBUTION, OR DISSEMINATION IS STRICTLY PROHIBITED. ALL RIGHTS TO THIS INFORMATION IS RESERVED BY FIX FLYER LLC. IF YOU ARE NOT THE INTENDED RECIPIENT, PLEASE CONTACT THE SENDER BY REPLY E-MAIL AND PLEASE DELETE THIS E-MAIL FROM YOUR SYSTEM AND DESTROY ANY COPIES.
-- *Charles Kozler* *Vice President, IT Operations* FIX Flyer, LLC 225 Broadway | Suite 1600 | New York, NY 10007 1-888-349-3593 http://www.fixflyer.com <http://fixflyer.com> NOTICE TO RECIPIENT: THIS E-MAIL IS MEANT ONLY FOR THE INTENDED RECIPIENT(S) OF THE TRANSMISSION, AND CONTAINS CONFIDENTIAL INFORMATION WHICH IS PROPRIETARY TO FIX FLYER LLC. ANY UNAUTHORIZED USE, COPYING, DISTRIBUTION, OR DISSEMINATION IS STRICTLY PROHIBITED. ALL RIGHTS TO THIS INFORMATION IS RESERVED BY FIX FLYER LLC. IF YOU ARE NOT THE INTENDED RECIPIENT, PLEASE CONTACT THE SENDER BY REPLY E-MAIL AND PLEASE DELETE THIS E-MAIL FROM YOUR SYSTEM AND DESTROY ANY COPIES.

Charles, check the broker log too please. It is possible that the broker process is running, but is not accepting connections for example. Martin On Wed, Jun 15, 2016 at 3:32 PM, Charles Kozler <charles@fixflyer.com> wrote:
Actually, broker is the only thing acting "right" between broker and agent. Broker is up when I bring the system up but agent is restarting all the time. Have a look
The 11th is when I restarted this node after doing 'reinstall' in the web UI
● ovirt-ha-broker.service - oVirt Hosted Engine High Availability Communications Broker Loaded: loaded (/usr/lib/systemd/system/ovirt-ha-broker.service; enabled; vendor preset: disabled) Active: active (running) since Sat 2016-06-11 13:09:51 EDT; 3 days ago Main PID: 1285 (ovirt-ha-broker) CGroup: /system.slice/ovirt-ha-broker.service └─1285 /usr/bin/python /usr/share/ovirt-hosted-engine-ha/ovirt-ha-broker --no-daemon
Jun 15 09:23:56 njsevcnp01 ovirt-ha-broker[1285]: INFO:mgmt_bridge.MgmtBridge:Found bridge ovirtmgmt with ports Jun 15 09:23:58 njsevcnp01 ovirt-ha-broker[1285]: INFO:ovirt_hosted_engine_ha.broker.listener.ConnectionHandler:Connection established Jun 15 09:23:58 njsevcnp01 ovirt-ha-broker[1285]: INFO:ovirt_hosted_engine_ha.broker.listener.ConnectionHandler:Connection closed Jun 15 09:23:58 njsevcnp01 ovirt-ha-broker[1285]: INFO:ovirt_hosted_engine_ha.broker.listener.ConnectionHandler:Connection established Jun 15 09:23:58 njsevcnp01 ovirt-ha-broker[1285]: INFO:ovirt_hosted_engine_ha.broker.listener.ConnectionHandler:Connection closed Jun 15 09:23:58 njsevcnp01 ovirt-ha-broker[1285]: INFO:ovirt_hosted_engine_ha.broker.listener.ConnectionHandler:Connection established Jun 15 09:23:58 njsevcnp01 ovirt-ha-broker[1285]: INFO:ovirt_hosted_engine_ha.broker.listener.ConnectionHandler:Connection closed Jun 15 09:23:58 njsevcnp01 ovirt-ha-broker[1285]: INFO:ovirt_hosted_engine_ha.broker.listener.ConnectionHandler:Connection established Jun 15 09:23:58 njsevcnp01 ovirt-ha-broker[1285]: INFO:ovirt_hosted_engine_ha.broker.listener.ConnectionHandler:Connection closed Jun 15 09:23:58 njsevcnp01 ovirt-ha-broker[1285]: INFO:mem_free.MemFree:memFree: 26408
Uptime of proc ..
# ps -Aef | grep -i broker vdsm 1285 1 2 Jun11 ? 02:27:50 /usr/bin/python /usr/share/ovirt-hosted-engine-ha/ovirt-ha-broker --no-daemon
But the agent... is restarting all the time
# ps -Aef | grep -i ovirt-ha-agent vdsm 76116 1 0 09:19 ? 00:00:01 /usr/bin/python /usr/share/ovirt-hosted-engine-ha/ovirt-ha-agent --no-daemon
9:19 AM ET is last restart. Even the logs say it
[root@njsevcnp01 ovirt-hosted-engine-ha]# grep -i 'restarting agent' agent.log | wc -l 232719
And the restarts every
[root@njsevcnp01 ovirt-hosted-engine-ha]# tail -n 300 agent.log | grep -i 'restarting agent' MainThread::WARNING::2016-06-15 09:23:53,029::agent::208::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent) Restarting agent, attempt '6' MainThread::WARNING::2016-06-15 09:24:28,953::agent::208::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent) Restarting agent, attempt '7' MainThread::WARNING::2016-06-15 09:25:04,879::agent::208::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent) Restarting agent, attempt '8' MainThread::WARNING::2016-06-15 09:25:40,790::agent::208::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent) Restarting agent, attempt '9' MainThread::WARNING::2016-06-15 09:26:17,136::agent::208::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent) Restarting agent, attempt '0' MainThread::WARNING::2016-06-15 09:26:53,063::agent::208::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent) Restarting agent, attempt '1'
Full log of restart is like this saying "connection timed out" but its not saying to *what* is timing out, so I have nothing else to really go on here
[root@njsevcnp01 ovirt-hosted-engine-ha]# tail -n 300 agent.log | grep -i restart MainThread::ERROR::2016-06-15 09:24:23,948::agent::205::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent) Error: 'Failed to start monitor <type 'type'>, options {'hostname': 'njsevcnp01'}: Connection timed out' - trying to restart agent MainThread::WARNING::2016-06-15 09:24:28,953::agent::208::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent) Restarting agent, attempt '7' MainThread::ERROR::2016-06-15 09:24:59,874::agent::205::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent) Error: 'Failed to start monitor <type 'type'>, options {'hostname': 'njsevcnp01'}: Connection timed out' - trying to restart agent MainThread::WARNING::2016-06-15 09:25:04,879::agent::208::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent) Restarting agent, attempt '8' MainThread::ERROR::2016-06-15 09:25:35,785::agent::205::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent) Error: 'Failed to start monitor <type 'type'>, options {'hostname': 'njsevcnp01'}: Connection timed out' - trying to restart agent MainThread::WARNING::2016-06-15 09:25:40,790::agent::208::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent) Restarting agent, attempt '9' MainThread::ERROR::2016-06-15 09:26:12,131::agent::205::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent) Error: 'Failed to start monitor <type 'type'>, options {'hostname': 'njsevcnp01'}: Connection timed out' - trying to restart agent MainThread::WARNING::2016-06-15 09:26:17,136::agent::208::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent) Restarting agent, attempt '0' MainThread::ERROR::2016-06-15 09:26:48,058::agent::205::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent) Error: 'Failed to start monitor <type 'type'>, options {'hostname': 'njsevcnp01'}: Connection timed out' - trying to restart agent MainThread::WARNING::2016-06-15 09:26:53,063::agent::208::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent) Restarting agent, attempt '1' MainThread::ERROR::2016-06-15 09:27:23,969::agent::205::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent) Error: 'Failed to start monitor <type 'type'>, options {'hostname': 'njsevcnp01'}: Connection timed out' - trying to restart agent MainThread::WARNING::2016-06-15 09:27:28,973::agent::208::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent) Restarting agent, attempt '2'
Storage is also completely fine. No logs stating anything "going away" or having issues. Engine has dedicated NFS NAS device meanwhile VM storage is completely separate storage cluster. Storage has 100% dedicated backend network with no changes being done
On Wed, Jun 15, 2016 at 7:42 AM, Martin Sivak <msivak@redhat.com> wrote:
Jun 14 08:11:11 njsevcnp01 ovirt-ha-agent[15713]: ovirt-ha-agent ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink ERROR Connection closed: Connection timed out Jun 14 08:11:11 njsevcnp01.fixflyer.com ovirt-ha-agent[15713]: ovirt-ha-agent ovirt_hosted_engine_ha.agent.agent.Agent ERROR Error: 'Failed to start monitor <type 'type'>, options {'hostname': 'njsevcnp01'}: Connection timed out' - trying to restart agent
Broker is broken or down. Check the status of ovirt-ha-broker service.
The other interesting thing is this log from node01. The odd thing is that it seems there is some split brain somewhere in oVirt because this log is from node02 but it is asking the engine and its getting back "vm not running on this host' rather than 'stale data'. But I dont know engine internals
This is another piece that points to broker or storage issues. Agent collects local data and then publishes them to other nodes through broker. So it is possible for the agent to know the status of the VM locally, but not be able to publish it.
hosted-engine command line tool then reads the synchronization whiteboard too, but it does not see anything that was not published and ends up reporting stale data.
What is the status of the hosted engine services? systemctl status ovirt-ha-agent ovirt-ha-broker
Please check the services.
Best regards
Martin
On Tue, Jun 14, 2016 at 2:16 PM, Charles Kozler <charles@fixflyer.com> wrote:
Martin -
One thing I noticed on all of the nodes is this:
Jun 14 08:11:11 njsevcnp01 ovirt-ha-agent[15713]: ovirt-ha-agent ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink ERROR Connection closed: Connection timed out Jun 14 08:11:11 njsevcnp01.fixflyer.com ovirt-ha-agent[15713]: ovirt-ha-agent ovirt_hosted_engine_ha.agent.agent.Agent ERROR Error: 'Failed to start monitor <type 'type'>, options {'hostname': 'njsevcnp01'}: Connection timed out' - trying to restart agent
Then the agent is restarted
[root@njsevcnp01 ~]# ps -Aef | grep -i ovirt-ha-agent | grep -iv grep vdsm 15713 1 0 08:09 ? 00:00:01 /usr/bin/python /usr/share/ovirt-hosted-engine-ha/ovirt-ha-agent --no-daemon
I dont know why the connection would time out because as you can see that log is from node01 and I cant figure out why its timing out on the connection
The other interesting thing is this log from node01. The odd thing is that it seems there is some split brain somewhere in oVirt because this log is from node02 but it is asking the engine and its getting back "vm not running on this host' rather than 'stale data'. But I dont know engine internals
MainThread::INFO::2016-06-14
08:13:05,163::state_machine::171::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(refresh) Host njsevcnp02 (id 2): {hostname: njsevcnp02, host-id: 2, engine-status: {reason: vm not running on this host, health: bad, vm: down, detail: unknown}, score: 0, stopped: True, maintenance: False, crc32: 25da07df, host-ts: 3030} MainThread::INFO::2016-06-14
08:13:05,163::state_machine::171::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(refresh) Host njsevcnp03 (id 3): {hostname: njsevcnp03, host-id: 3, engine-status: {reason: vm not running on this host, health: bad, vm: down, detail: unknown}, score: 0, stopped: True, maintenance: False, crc32: c67818cb, host-ts: 10877406}
And that same log on node02 where the engine is running
MainThread::INFO::2016-06-14
08:15:44,451::state_machine::171::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(refresh) Host njsevcnp01 (id 1): {hostname: njsevcnp01, host-id: 1, engine-status: {reason: vm not running on this host, health: bad, vm: down, detail: unknown}, score: 0, stopped: True, maintenance: False, crc32: 260dbf06, host-ts: 327} MainThread::INFO::2016-06-14
08:15:44,451::state_machine::171::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(refresh) Host njsevcnp03 (id 3): {hostname: njsevcnp03, host-id: 3, engine-status: {reason: vm not running on this host, health: bad, vm: down, detail: unknown}, score: 0, stopped: True, maintenance: False, crc32: c67818cb, host-ts: 10877406} MainThread::INFO::2016-06-14
08:15:44,451::state_machine::174::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(refresh) Local (id 2): {engine-health: {health: good, vm: up, detail: up}, bridge: True, mem-free: 20702.0, maintenance: False, cpu-load: None, gateway: True} MainThread::INFO::2016-06-14
08:15:44,452::brokerlink::111::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify) Trying: notify time=1465906544.45 type=state_transition detail=StartState-ReinitializeFSM hostname=njsevcnp02
On Tue, Jun 14, 2016 at 7:59 AM, Martin Sivak <msivak@redhat.com> wrote:
Hi,
is there anything interesting in the hosted engine log files? /var/log/ovirt-hosted-engine-ha/agent.log
There should be something appearing there every 10 seconds or faster.
What is the status of the hosted engine services? systemctl status ovirt-ha-agent ovirt-ha-broker
Best regards
-- Martin Sivak SLA / oVirt
On Sat, Jun 11, 2016 at 8:53 PM, Charles Kozler <charles@fixflyer.com> wrote:
See linked images please. As you can see all three nodes are reporting stale data. The results of this are:
1. Not all VM's migrate seamlessly in the cluster. Sometimes I have to shut them down to get them to be able to migrate again
2. Hosted engine refuses to move due to constraints (image). This part doesnt make sense to me because I can forcefully shut it down and then go directly on a hosted engine node and bring it back up. Also, the Web UI shows all nodes under the cluster except then it thinks its not apart of the cluster
3. Time is in sync (image)
4. Storage is 100% fine. Gluster back end reports mirroring and status 'started'. No split brain has occurred and ovirt nodes have never lost connectivity to storage
5. I reinstalled all three nodes. For some reason only node 3 still shows as having updates available. (image). For clarity, I did not click "upgrade" I simply did 'reinstall' from the Web UI. Having looked at the output and yum.log from /var/log it almost looks like it did do an update. All package versions across all three nodes are the same (respective to ovirt/vdsm) (image). For some reason though ovirt-engine-appliance-3.6-20160126.1.el7.centos.noarch exists on node 1 but not on node 2 or 3. Could this be relative? I dont recall installing that specifically on node 1 but I may have
Been slamming my head on this so I am hoping you can provide some assistance
Thanks!
--
Charles Kozler Vice President, IT Operations
FIX Flyer, LLC 225 Broadway | Suite 1600 | New York, NY 10007 1-888-349-3593 http://www.fixflyer.com
NOTICE TO RECIPIENT: THIS E-MAIL IS MEANT ONLY FOR THE INTENDED RECIPIENT(S) OF THE TRANSMISSION, AND CONTAINS CONFIDENTIAL INFORMATION WHICH IS PROPRIETARY TO FIX FLYER LLC. ANY UNAUTHORIZED USE, COPYING, DISTRIBUTION, OR DISSEMINATION IS STRICTLY PROHIBITED. ALL RIGHTS TO THIS INFORMATION IS RESERVED BY FIX FLYER LLC. IF YOU ARE NOT THE INTENDED RECIPIENT, PLEASE CONTACT THE SENDER BY REPLY E-MAIL AND PLEASE DELETE THIS E-MAIL FROM YOUR SYSTEM AND DESTROY ANY COPIES.
_______________________________________________ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users
--
Charles Kozler Vice President, IT Operations
FIX Flyer, LLC 225 Broadway | Suite 1600 | New York, NY 10007 1-888-349-3593 http://www.fixflyer.com
NOTICE TO RECIPIENT: THIS E-MAIL IS MEANT ONLY FOR THE INTENDED RECIPIENT(S) OF THE TRANSMISSION, AND CONTAINS CONFIDENTIAL INFORMATION WHICH IS PROPRIETARY TO FIX FLYER LLC. ANY UNAUTHORIZED USE, COPYING, DISTRIBUTION, OR DISSEMINATION IS STRICTLY PROHIBITED. ALL RIGHTS TO THIS INFORMATION IS RESERVED BY FIX FLYER LLC. IF YOU ARE NOT THE INTENDED RECIPIENT, PLEASE CONTACT THE SENDER BY REPLY E-MAIL AND PLEASE DELETE THIS E-MAIL FROM YOUR SYSTEM AND DESTROY ANY COPIES.
--
Charles Kozler Vice President, IT Operations
FIX Flyer, LLC 225 Broadway | Suite 1600 | New York, NY 10007 1-888-349-3593 http://www.fixflyer.com
NOTICE TO RECIPIENT: THIS E-MAIL IS MEANT ONLY FOR THE INTENDED RECIPIENT(S) OF THE TRANSMISSION, AND CONTAINS CONFIDENTIAL INFORMATION WHICH IS PROPRIETARY TO FIX FLYER LLC. ANY UNAUTHORIZED USE, COPYING, DISTRIBUTION, OR DISSEMINATION IS STRICTLY PROHIBITED. ALL RIGHTS TO THIS INFORMATION IS RESERVED BY FIX FLYER LLC. IF YOU ARE NOT THE INTENDED RECIPIENT, PLEASE CONTACT THE SENDER BY REPLY E-MAIL AND PLEASE DELETE THIS E-MAIL FROM YOUR SYSTEM AND DESTROY ANY COPIES.

Marin - Anything I should be looking for specifically? The only errors I see are smtp errors when it tries to send a notification but nothing indicating what the notification is / might be. I see this repeated about every minute Thread-482115::INFO::2016-06-14 12:58:54,431::listener::134::ovirt_hosted_engine_ha.broker.listener.ConnectionHandler::(setup) Connection established Thread-482109::INFO::2016-06-14 12:58:54,491::storage_backends::120::ovirt_hosted_engine_ha.lib.storage_backends::(_check_symlinks) Cleaning up stale LV link '/rhev/data-center/mnt/nas01:_volume1_vm__os_ovirt 36__engine/c6323975-2966-409d-b9e0-48370a513a98/ha_agent/hosted-engine.lockspace' Thread-482109::INFO::2016-06-14 12:58:54,515::storage_backends::120::ovirt_hosted_engine_ha.lib.storage_backends::(_check_symlinks) Cleaning up stale LV link '/rhev/data-center/mnt/nas01:_volume1_vm__os_ovirt 36__engine/c6323975-2966-409d-b9e0-48370a513a98/ha_agent/hosted-engine.metadata' nas01 is the primary storage for the engine (as previously noted) Thread-482175::INFO::2016-06-14 12:59:30,398::storage_backends::120::ovirt_hosted_engine_ha.lib.storage_backends::(_check_symlinks) Cleaning up stale LV link '/rhev/data-center/mnt/nas01:_volume1_vm__os_ovirt 36__engine/c6323975-2966-409d-b9e0-48370a513a98/ha_agent/hosted-engine.lockspace' Thread-482175::INFO::2016-06-14 12:59:30,429::storage_backends::120::ovirt_hosted_engine_ha.lib.storage_backends::(_check_symlinks) Cleaning up stale LV link '/rhev/data-center/mnt/nas01:_volume1_vm__os_ovirt 36__engine/c6323975-2966-409d-b9e0-48370a513a98/ha_agent/hosted-engine.metadata' But otherwise the broker looks like its accepting and handling connections Thread-481980::INFO::2016-06-14 12:59:33,105::mem_free::53::mem_free.MemFree::(action) memFree: 26491 Thread-482193::INFO::2016-06-14 12:59:33,977::listener::134::ovirt_hosted_engine_ha.broker.listener.ConnectionHandler::(setup) Connection established Thread-482193::INFO::2016-06-14 12:59:34,033::listener::186::ovirt_hosted_engine_ha.broker.listener.ConnectionHandler::(handle) Connection closed Thread-482194::INFO::2016-06-14 12:59:34,034::listener::134::ovirt_hosted_engine_ha.broker.listener.ConnectionHandler::(setup) Connection established Thread-482194::INFO::2016-06-14 12:59:34,035::listener::186::ovirt_hosted_engine_ha.broker.listener.ConnectionHandler::(handle) Connection closed Thread-482195::INFO::2016-06-14 12:59:34,035::listener::134::ovirt_hosted_engine_ha.broker.listener.ConnectionHandler::(setup) Connection established Thread-482195::INFO::2016-06-14 12:59:34,036::listener::186::ovirt_hosted_engine_ha.broker.listener.ConnectionHandler::(handle) Connection closed Thread-482196::INFO::2016-06-14 12:59:34,037::listener::134::ovirt_hosted_engine_ha.broker.listener.ConnectionHandler::(setup) Connection established Thread-482196::INFO::2016-06-14 12:59:34,037::listener::186::ovirt_hosted_engine_ha.broker.listener.ConnectionHandler::(handle) Connection closed Thread-482197::INFO::2016-06-14 12:59:38,544::listener::134::ovirt_hosted_engine_ha.broker.listener.ConnectionHandler::(setup) Connection established Thread-482197::INFO::2016-06-14 12:59:38,598::listener::186::ovirt_hosted_engine_ha.broker.listener.ConnectionHandler::(handle) Connection closed Thread-482198::INFO::2016-06-14 12:59:38,598::listener::134::ovirt_hosted_engine_ha.broker.listener.ConnectionHandler::(setup) Connection established Thread-482198::INFO::2016-06-14 12:59:38,599::listener::186::ovirt_hosted_engine_ha.broker.listener.ConnectionHandler::(handle) Connection closed Thread-482199::INFO::2016-06-14 12:59:38,600::listener::134::ovirt_hosted_engine_ha.broker.listener.ConnectionHandler::(setup) Connection established Thread-482199::INFO::2016-06-14 12:59:38,600::listener::186::ovirt_hosted_engine_ha.broker.listener.ConnectionHandler::(handle) Connection closed Thread-482200::INFO::2016-06-14 12:59:38,601::listener::134::ovirt_hosted_engine_ha.broker.listener.ConnectionHandler::(setup) Connection established Thread-482200::INFO::2016-06-14 12:59:38,602::listener::186::ovirt_hosted_engine_ha.broker.listener.ConnectionHandler::(handle) Connection closed Thread-482179::INFO::2016-06-14 12:59:40,339::cpu_load_no_engine::121::cpu_load_no_engine.EngineHealth::(calculate_load) System load total=0.0078, engine=0.0000, non-engine=0.0078 Thread-482178::INFO::2016-06-14 12:59:49,745::mem_free::53::mem_free.MemFree::(action) memFree: 26500 Thread-481977::ERROR::2016-06-14 12:59:50,263::notifications::35::ovirt_hosted_engine_ha.broker.notifications.Notifications::(send_email) [Errno 110] Connection timed out Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/broker/notifications.py", line 24, in send_email server = smtplib.SMTP(cfg["smtp-server"], port=cfg["smtp-port"]) File "/usr/lib64/python2.7/smtplib.py", line 255, in __init__ (code, msg) = self.connect(host, port) File "/usr/lib64/python2.7/smtplib.py", line 315, in connect self.sock = self._get_socket(host, port, self.timeout) File "/usr/lib64/python2.7/smtplib.py", line 290, in _get_socket return socket.create_connection((host, port), timeout) File "/usr/lib64/python2.7/socket.py", line 571, in create_connection raise err error: [Errno 110] Connection timed out Thread-481977::INFO::2016-06-14 12:59:50,264::listener::186::ovirt_hosted_engine_ha.broker.listener.ConnectionHandler::(handle) Connection closed Thread-481977::INFO::2016-06-14 12:59:50,264::monitor::90::ovirt_hosted_engine_ha.broker.monitor.Monitor::(stop_submonitor) Stopping submonitor ping, id 140681792007632 Thread-481977::INFO::2016-06-14 12:59:50,264::monitor::99::ovirt_hosted_engine_ha.broker.monitor.Monitor::(stop_submonitor) Stopped submonitor ping, id 140681792007632 Thread-481977::INFO::2016-06-14 12:59:50,264::monitor::90::ovirt_hosted_engine_ha.broker.monitor.Monitor::(stop_submonitor) Stopping submonitor mgmt-bridge, id 140681925896272 Thread-481977::INFO::2016-06-14 12:59:50,264::monitor::99::ovirt_hosted_engine_ha.broker.monitor.Monitor::(stop_submonitor) Stopped submonitor mgmt-bridge, id 140681925896272 Thread-481977::INFO::2016-06-14 12:59:50,264::monitor::90::ovirt_hosted_engine_ha.broker.monitor.Monitor::(stop_submonitor) Stopping submonitor mem-free, id 140681926005456 Thread-481977::INFO::2016-06-14 12:59:50,264::monitor::99::ovirt_hosted_engine_ha.broker.monitor.Monitor::(stop_submonitor) Stopped submonitor mem-free, id 140681926005456 Thread-481977::INFO::2016-06-14 12:59:50,264::monitor::90::ovirt_hosted_engine_ha.broker.monitor.Monitor::(stop_submonitor) Stopping submonitor cpu-load-no-engine, id 140681926012880 Thread-481977::INFO::2016-06-14 12:59:50,264::monitor::99::ovirt_hosted_engine_ha.broker.monitor.Monitor::(stop_submonitor) Stopped submonitor cpu-load-no-engine, id 140681926012880 Thread-481977::INFO::2016-06-14 12:59:50,264::monitor::90::ovirt_hosted_engine_ha.broker.monitor.Monitor::(stop_submonitor) Stopping submonitor engine-health, id 140681926011984 Thread-481977::INFO::2016-06-14 12:59:50,264::monitor::99::ovirt_hosted_engine_ha.broker.monitor.Monitor::(stop_submonitor) Stopped submonitor engine-health, id 140681926011984 On Wed, Jun 15, 2016 at 10:04 AM, Martin Sivak <msivak@redhat.com> wrote:
Charles, check the broker log too please. It is possible that the broker process is running, but is not accepting connections for example.
Martin
On Wed, Jun 15, 2016 at 3:32 PM, Charles Kozler <charles@fixflyer.com> wrote:
Actually, broker is the only thing acting "right" between broker and agent. Broker is up when I bring the system up but agent is restarting all the time. Have a look
The 11th is when I restarted this node after doing 'reinstall' in the web UI
● ovirt-ha-broker.service - oVirt Hosted Engine High Availability Communications Broker Loaded: loaded (/usr/lib/systemd/system/ovirt-ha-broker.service; enabled; vendor preset: disabled) Active: active (running) since Sat 2016-06-11 13:09:51 EDT; 3 days ago Main PID: 1285 (ovirt-ha-broker) CGroup: /system.slice/ovirt-ha-broker.service └─1285 /usr/bin/python /usr/share/ovirt-hosted-engine-ha/ovirt-ha-broker --no-daemon
Jun 15 09:23:56 njsevcnp01 ovirt-ha-broker[1285]: INFO:mgmt_bridge.MgmtBridge:Found bridge ovirtmgmt with ports Jun 15 09:23:58 njsevcnp01 ovirt-ha-broker[1285]: INFO:ovirt_hosted_engine_ha.broker.listener.ConnectionHandler:Connection established Jun 15 09:23:58 njsevcnp01 ovirt-ha-broker[1285]: INFO:ovirt_hosted_engine_ha.broker.listener.ConnectionHandler:Connection closed Jun 15 09:23:58 njsevcnp01 ovirt-ha-broker[1285]: INFO:ovirt_hosted_engine_ha.broker.listener.ConnectionHandler:Connection established Jun 15 09:23:58 njsevcnp01 ovirt-ha-broker[1285]: INFO:ovirt_hosted_engine_ha.broker.listener.ConnectionHandler:Connection closed Jun 15 09:23:58 njsevcnp01 ovirt-ha-broker[1285]: INFO:ovirt_hosted_engine_ha.broker.listener.ConnectionHandler:Connection established Jun 15 09:23:58 njsevcnp01 ovirt-ha-broker[1285]: INFO:ovirt_hosted_engine_ha.broker.listener.ConnectionHandler:Connection closed Jun 15 09:23:58 njsevcnp01 ovirt-ha-broker[1285]: INFO:ovirt_hosted_engine_ha.broker.listener.ConnectionHandler:Connection established Jun 15 09:23:58 njsevcnp01 ovirt-ha-broker[1285]: INFO:ovirt_hosted_engine_ha.broker.listener.ConnectionHandler:Connection closed Jun 15 09:23:58 njsevcnp01 ovirt-ha-broker[1285]: INFO:mem_free.MemFree:memFree: 26408
Uptime of proc ..
# ps -Aef | grep -i broker vdsm 1285 1 2 Jun11 ? 02:27:50 /usr/bin/python /usr/share/ovirt-hosted-engine-ha/ovirt-ha-broker --no-daemon
But the agent... is restarting all the time
# ps -Aef | grep -i ovirt-ha-agent vdsm 76116 1 0 09:19 ? 00:00:01 /usr/bin/python /usr/share/ovirt-hosted-engine-ha/ovirt-ha-agent --no-daemon
9:19 AM ET is last restart. Even the logs say it
[root@njsevcnp01 ovirt-hosted-engine-ha]# grep -i 'restarting agent' agent.log | wc -l 232719
And the restarts every
[root@njsevcnp01 ovirt-hosted-engine-ha]# tail -n 300 agent.log | grep -i 'restarting agent' MainThread::WARNING::2016-06-15
09:23:53,029::agent::208::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent)
Restarting agent, attempt '6' MainThread::WARNING::2016-06-15
09:24:28,953::agent::208::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent)
Restarting agent, attempt '7' MainThread::WARNING::2016-06-15
09:25:04,879::agent::208::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent)
Restarting agent, attempt '8' MainThread::WARNING::2016-06-15
09:25:40,790::agent::208::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent)
Restarting agent, attempt '9' MainThread::WARNING::2016-06-15
09:26:17,136::agent::208::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent)
Restarting agent, attempt '0' MainThread::WARNING::2016-06-15
09:26:53,063::agent::208::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent)
Restarting agent, attempt '1'
Full log of restart is like this saying "connection timed out" but its not saying to *what* is timing out, so I have nothing else to really go on here
[root@njsevcnp01 ovirt-hosted-engine-ha]# tail -n 300 agent.log | grep -i restart MainThread::ERROR::2016-06-15
09:24:23,948::agent::205::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent)
Error: 'Failed to start monitor <type 'type'>, options {'hostname': 'njsevcnp01'}: Connection timed out' - trying to restart agent MainThread::WARNING::2016-06-15
09:24:28,953::agent::208::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent)
Restarting agent, attempt '7' MainThread::ERROR::2016-06-15
09:24:59,874::agent::205::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent)
Error: 'Failed to start monitor <type 'type'>, options {'hostname': 'njsevcnp01'}: Connection timed out' - trying to restart agent MainThread::WARNING::2016-06-15
09:25:04,879::agent::208::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent)
Restarting agent, attempt '8' MainThread::ERROR::2016-06-15
09:25:35,785::agent::205::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent)
Error: 'Failed to start monitor <type 'type'>, options {'hostname': 'njsevcnp01'}: Connection timed out' - trying to restart agent MainThread::WARNING::2016-06-15
09:25:40,790::agent::208::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent)
Restarting agent, attempt '9' MainThread::ERROR::2016-06-15
09:26:12,131::agent::205::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent)
Error: 'Failed to start monitor <type 'type'>, options {'hostname': 'njsevcnp01'}: Connection timed out' - trying to restart agent MainThread::WARNING::2016-06-15
09:26:17,136::agent::208::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent)
Restarting agent, attempt '0' MainThread::ERROR::2016-06-15
09:26:48,058::agent::205::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent)
Error: 'Failed to start monitor <type 'type'>, options {'hostname': 'njsevcnp01'}: Connection timed out' - trying to restart agent MainThread::WARNING::2016-06-15
09:26:53,063::agent::208::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent)
Restarting agent, attempt '1' MainThread::ERROR::2016-06-15
09:27:23,969::agent::205::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent)
Error: 'Failed to start monitor <type 'type'>, options {'hostname': 'njsevcnp01'}: Connection timed out' - trying to restart agent MainThread::WARNING::2016-06-15
09:27:28,973::agent::208::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent)
Restarting agent, attempt '2'
Storage is also completely fine. No logs stating anything "going away" or having issues. Engine has dedicated NFS NAS device meanwhile VM storage is completely separate storage cluster. Storage has 100% dedicated backend network with no changes being done
On Wed, Jun 15, 2016 at 7:42 AM, Martin Sivak <msivak@redhat.com> wrote:
Jun 14 08:11:11 njsevcnp01 ovirt-ha-agent[15713]: ovirt-ha-agent ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink ERROR Connection closed: Connection timed out Jun 14 08:11:11 njsevcnp01.fixflyer.com ovirt-ha-agent[15713]: ovirt-ha-agent ovirt_hosted_engine_ha.agent.agent.Agent ERROR Error: 'Failed to start monitor <type 'type'>, options {'hostname': 'njsevcnp01'}: Connection timed out' - trying to restart agent
Broker is broken or down. Check the status of ovirt-ha-broker service.
The other interesting thing is this log from node01. The odd thing is that it seems there is some split brain somewhere in oVirt because this log is from node02 but it is asking the engine and its getting back "vm not running on this host' rather than 'stale data'. But I dont know engine
internals
This is another piece that points to broker or storage issues. Agent collects local data and then publishes them to other nodes through broker. So it is possible for the agent to know the status of the VM locally, but not be able to publish it.
hosted-engine command line tool then reads the synchronization whiteboard too, but it does not see anything that was not published and ends up reporting stale data.
What is the status of the hosted engine services? systemctl status ovirt-ha-agent ovirt-ha-broker
Please check the services.
Best regards
Martin
On Tue, Jun 14, 2016 at 2:16 PM, Charles Kozler <charles@fixflyer.com> wrote:
Martin -
One thing I noticed on all of the nodes is this:
Jun 14 08:11:11 njsevcnp01 ovirt-ha-agent[15713]: ovirt-ha-agent ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink ERROR Connection closed: Connection timed out Jun 14 08:11:11 njsevcnp01.fixflyer.com ovirt-ha-agent[15713]: ovirt-ha-agent ovirt_hosted_engine_ha.agent.agent.Agent ERROR Error: 'Failed to start monitor <type 'type'>, options {'hostname': 'njsevcnp01'}: Connection timed out' - trying to restart agent
Then the agent is restarted
[root@njsevcnp01 ~]# ps -Aef | grep -i ovirt-ha-agent | grep -iv grep vdsm 15713 1 0 08:09 ? 00:00:01 /usr/bin/python /usr/share/ovirt-hosted-engine-ha/ovirt-ha-agent --no-daemon
I dont know why the connection would time out because as you can see that log is from node01 and I cant figure out why its timing out on the connection
The other interesting thing is this log from node01. The odd thing is that it seems there is some split brain somewhere in oVirt because this log is from node02 but it is asking the engine and its getting back "vm not running on this host' rather than 'stale data'. But I dont know engine
internals
MainThread::INFO::2016-06-14
08:13:05,163::state_machine::171::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(refresh)
Host njsevcnp02 (id 2): {hostname: njsevcnp02, host-id: 2, engine-status: {reason: vm not running on this host, health: bad, vm: down, detail: unknown}, score: 0, stopped: True, maintenance: False, crc32: 25da07df, host-ts: 3030} MainThread::INFO::2016-06-14
08:13:05,163::state_machine::171::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(refresh)
Host njsevcnp03 (id 3): {hostname: njsevcnp03, host-id: 3, engine-status: {reason: vm not running on this host, health: bad, vm: down, detail: unknown}, score: 0, stopped: True, maintenance: False, crc32: c67818cb, host-ts: 10877406}
And that same log on node02 where the engine is running
MainThread::INFO::2016-06-14
08:15:44,451::state_machine::171::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(refresh)
Host njsevcnp01 (id 1): {hostname: njsevcnp01, host-id: 1, engine-status: {reason: vm not running on this host, health: bad, vm: down, detail: unknown}, score: 0, stopped: True, maintenance: False, crc32: 260dbf06, host-ts: 327} MainThread::INFO::2016-06-14
08:15:44,451::state_machine::171::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(refresh)
Host njsevcnp03 (id 3): {hostname: njsevcnp03, host-id: 3, engine-status: {reason: vm not running on this host, health: bad, vm: down, detail: unknown}, score: 0, stopped: True, maintenance: False, crc32: c67818cb, host-ts: 10877406} MainThread::INFO::2016-06-14
08:15:44,451::state_machine::174::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(refresh)
Local (id 2): {engine-health: {health: good, vm: up, detail: up}, bridge: True, mem-free: 20702.0, maintenance: False, cpu-load: None, gateway: True} MainThread::INFO::2016-06-14
08:15:44,452::brokerlink::111::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify)
Trying: notify time=1465906544.45 type=state_transition detail=StartState-ReinitializeFSM hostname=njsevcnp02
On Tue, Jun 14, 2016 at 7:59 AM, Martin Sivak <msivak@redhat.com> wrote:
Hi,
is there anything interesting in the hosted engine log files? /var/log/ovirt-hosted-engine-ha/agent.log
There should be something appearing there every 10 seconds or faster.
What is the status of the hosted engine services? systemctl status ovirt-ha-agent ovirt-ha-broker
Best regards
-- Martin Sivak SLA / oVirt
On Sat, Jun 11, 2016 at 8:53 PM, Charles Kozler <
charles@fixflyer.com>
wrote:
See linked images please. As you can see all three nodes are reporting stale data. The results of this are:
1. Not all VM's migrate seamlessly in the cluster. Sometimes I have to shut them down to get them to be able to migrate again
2. Hosted engine refuses to move due to constraints (image). This part doesnt make sense to me because I can forcefully shut it down and then go directly on a hosted engine node and bring it back up. Also, the Web UI shows all nodes under the cluster except then it thinks its not apart of the cluster
3. Time is in sync (image)
4. Storage is 100% fine. Gluster back end reports mirroring and status 'started'. No split brain has occurred and ovirt nodes have never lost connectivity to storage
5. I reinstalled all three nodes. For some reason only node 3 still shows as having updates available. (image). For clarity, I did not click "upgrade" I simply did 'reinstall' from the Web UI. Having looked at the output and yum.log from /var/log it almost looks like it did do an update. All package versions across all three nodes are the same (respective to ovirt/vdsm) (image). For some reason though ovirt-engine-appliance-3.6-20160126.1.el7.centos.noarch exists on node 1 but not on node 2 or 3. Could this be relative? I dont recall installing that specifically on node 1 but I may have
Been slamming my head on this so I am hoping you can provide some assistance
Thanks!
--
Charles Kozler Vice President, IT Operations
FIX Flyer, LLC 225 Broadway | Suite 1600 | New York, NY 10007 1-888-349-3593 http://www.fixflyer.com
NOTICE TO RECIPIENT: THIS E-MAIL IS MEANT ONLY FOR THE INTENDED RECIPIENT(S) OF THE TRANSMISSION, AND CONTAINS CONFIDENTIAL INFORMATION WHICH IS PROPRIETARY TO FIX FLYER LLC. ANY UNAUTHORIZED USE, COPYING, DISTRIBUTION, OR DISSEMINATION IS STRICTLY PROHIBITED. ALL RIGHTS TO THIS INFORMATION IS RESERVED BY FIX FLYER LLC. IF YOU ARE NOT THE INTENDED RECIPIENT, PLEASE CONTACT THE SENDER BY REPLY E-MAIL AND PLEASE DELETE THIS E-MAIL FROM YOUR SYSTEM AND DESTROY ANY COPIES.
_______________________________________________ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users
--
Charles Kozler Vice President, IT Operations
FIX Flyer, LLC 225 Broadway | Suite 1600 | New York, NY 10007 1-888-349-3593 http://www.fixflyer.com
NOTICE TO RECIPIENT: THIS E-MAIL IS MEANT ONLY FOR THE INTENDED RECIPIENT(S) OF THE TRANSMISSION, AND CONTAINS CONFIDENTIAL INFORMATION WHICH IS PROPRIETARY TO FIX FLYER LLC. ANY UNAUTHORIZED USE, COPYING, DISTRIBUTION, OR DISSEMINATION IS STRICTLY PROHIBITED. ALL RIGHTS TO THIS INFORMATION IS RESERVED BY FIX FLYER LLC. IF YOU ARE NOT THE INTENDED RECIPIENT, PLEASE CONTACT THE SENDER BY REPLY E-MAIL AND PLEASE DELETE THIS E-MAIL FROM YOUR SYSTEM AND DESTROY ANY COPIES.
--
Charles Kozler Vice President, IT Operations
FIX Flyer, LLC 225 Broadway | Suite 1600 | New York, NY 10007 1-888-349-3593 http://www.fixflyer.com
NOTICE TO RECIPIENT: THIS E-MAIL IS MEANT ONLY FOR THE INTENDED RECIPIENT(S) OF THE TRANSMISSION, AND CONTAINS CONFIDENTIAL INFORMATION WHICH IS PROPRIETARY TO FIX FLYER LLC. ANY UNAUTHORIZED USE, COPYING, DISTRIBUTION, OR DISSEMINATION IS STRICTLY PROHIBITED. ALL RIGHTS TO THIS INFORMATION IS RESERVED BY FIX FLYER LLC. IF YOU ARE NOT THE INTENDED RECIPIENT, PLEASE CONTACT THE SENDER BY REPLY E-MAIL AND PLEASE DELETE THIS E-MAIL FROM YOUR SYSTEM AND DESTROY ANY COPIES.
-- *Charles Kozler* *Vice President, IT Operations* FIX Flyer, LLC 225 Broadway | Suite 1600 | New York, NY 10007 1-888-349-3593 http://www.fixflyer.com <http://fixflyer.com> NOTICE TO RECIPIENT: THIS E-MAIL IS MEANT ONLY FOR THE INTENDED RECIPIENT(S) OF THE TRANSMISSION, AND CONTAINS CONFIDENTIAL INFORMATION WHICH IS PROPRIETARY TO FIX FLYER LLC. ANY UNAUTHORIZED USE, COPYING, DISTRIBUTION, OR DISSEMINATION IS STRICTLY PROHIBITED. ALL RIGHTS TO THIS INFORMATION IS RESERVED BY FIX FLYER LLC. IF YOU ARE NOT THE INTENDED RECIPIENT, PLEASE CONTACT THE SENDER BY REPLY E-MAIL AND PLEASE DELETE THIS E-MAIL FROM YOUR SYSTEM AND DESTROY ANY COPIES.

"/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/broker/notifications.py", line 24, in send_email server = smtplib.SMTP(cfg["smtp-server"], port=cfg["smtp-port"]) File "/usr/lib64/python2.7/smtplib.py", line 255, in __init__ (code, msg) = self.connect(host, port) File "/usr/lib64/python2.7/smtplib.py", line 315, in connect self.sock = self._get_socket(host, port, self.timeout) File "/usr/lib64/python2.7/smtplib.py", line 290, in _get_socket return socket.create_connection((host, port), timeout) File "/usr/lib64/python2.7/socket.py", line 571, in create_connection raise err error: [Errno 110] Connection timed out
So you have connection timeout here (it is trying to reach the localhost smtp server)
09:24:59,874::agent::205::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent) Error: 'Failed to start monitor <type 'type'>, options {'hostname': 'njsevcnp01'}: Connection timed out' - trying to restart agent MainThread::WARNING::2016-06-15
and connection timeout between agent and broker.
Thread-482175::INFO::2016-06-14 12:59:30,429::storage_backends::120::ovirt_hosted_engine_ha.lib.storage_backends::(_check_symlinks) Cleaning up stale LV link '/rhev/data-center/mnt/nas01:_volume1_vm__os_ovirt 36__engine/c6323975-2966-409d-b9e0-48370a513a98/ha_agent/hosted-engine.metadata'
This is also not normal, it means the storage disappeared. This seems to indicate there is some kind of issue with your network.. are you sure that your firewall allows connections over lo interface and to the storage server? Martin On Wed, Jun 15, 2016 at 4:11 PM, Charles Kozler <charles@fixflyer.com> wrote:
Marin -
Anything I should be looking for specifically? The only errors I see are smtp errors when it tries to send a notification but nothing indicating what the notification is / might be. I see this repeated about every minute
Thread-482115::INFO::2016-06-14 12:58:54,431::listener::134::ovirt_hosted_engine_ha.broker.listener.ConnectionHandler::(setup) Connection established Thread-482109::INFO::2016-06-14 12:58:54,491::storage_backends::120::ovirt_hosted_engine_ha.lib.storage_backends::(_check_symlinks) Cleaning up stale LV link '/rhev/data-center/mnt/nas01:_volume1_vm__os_ovirt 36__engine/c6323975-2966-409d-b9e0-48370a513a98/ha_agent/hosted-engine.lockspace' Thread-482109::INFO::2016-06-14 12:58:54,515::storage_backends::120::ovirt_hosted_engine_ha.lib.storage_backends::(_check_symlinks) Cleaning up stale LV link '/rhev/data-center/mnt/nas01:_volume1_vm__os_ovirt 36__engine/c6323975-2966-409d-b9e0-48370a513a98/ha_agent/hosted-engine.metadata'
nas01 is the primary storage for the engine (as previously noted)
Thread-482175::INFO::2016-06-14 12:59:30,398::storage_backends::120::ovirt_hosted_engine_ha.lib.storage_backends::(_check_symlinks) Cleaning up stale LV link '/rhev/data-center/mnt/nas01:_volume1_vm__os_ovirt 36__engine/c6323975-2966-409d-b9e0-48370a513a98/ha_agent/hosted-engine.lockspace' Thread-482175::INFO::2016-06-14 12:59:30,429::storage_backends::120::ovirt_hosted_engine_ha.lib.storage_backends::(_check_symlinks) Cleaning up stale LV link '/rhev/data-center/mnt/nas01:_volume1_vm__os_ovirt 36__engine/c6323975-2966-409d-b9e0-48370a513a98/ha_agent/hosted-engine.metadata'
But otherwise the broker looks like its accepting and handling connections
Thread-481980::INFO::2016-06-14 12:59:33,105::mem_free::53::mem_free.MemFree::(action) memFree: 26491 Thread-482193::INFO::2016-06-14 12:59:33,977::listener::134::ovirt_hosted_engine_ha.broker.listener.ConnectionHandler::(setup) Connection established Thread-482193::INFO::2016-06-14 12:59:34,033::listener::186::ovirt_hosted_engine_ha.broker.listener.ConnectionHandler::(handle) Connection closed Thread-482194::INFO::2016-06-14 12:59:34,034::listener::134::ovirt_hosted_engine_ha.broker.listener.ConnectionHandler::(setup) Connection established Thread-482194::INFO::2016-06-14 12:59:34,035::listener::186::ovirt_hosted_engine_ha.broker.listener.ConnectionHandler::(handle) Connection closed Thread-482195::INFO::2016-06-14 12:59:34,035::listener::134::ovirt_hosted_engine_ha.broker.listener.ConnectionHandler::(setup) Connection established Thread-482195::INFO::2016-06-14 12:59:34,036::listener::186::ovirt_hosted_engine_ha.broker.listener.ConnectionHandler::(handle) Connection closed Thread-482196::INFO::2016-06-14 12:59:34,037::listener::134::ovirt_hosted_engine_ha.broker.listener.ConnectionHandler::(setup) Connection established Thread-482196::INFO::2016-06-14 12:59:34,037::listener::186::ovirt_hosted_engine_ha.broker.listener.ConnectionHandler::(handle) Connection closed Thread-482197::INFO::2016-06-14 12:59:38,544::listener::134::ovirt_hosted_engine_ha.broker.listener.ConnectionHandler::(setup) Connection established Thread-482197::INFO::2016-06-14 12:59:38,598::listener::186::ovirt_hosted_engine_ha.broker.listener.ConnectionHandler::(handle) Connection closed Thread-482198::INFO::2016-06-14 12:59:38,598::listener::134::ovirt_hosted_engine_ha.broker.listener.ConnectionHandler::(setup) Connection established Thread-482198::INFO::2016-06-14 12:59:38,599::listener::186::ovirt_hosted_engine_ha.broker.listener.ConnectionHandler::(handle) Connection closed Thread-482199::INFO::2016-06-14 12:59:38,600::listener::134::ovirt_hosted_engine_ha.broker.listener.ConnectionHandler::(setup) Connection established Thread-482199::INFO::2016-06-14 12:59:38,600::listener::186::ovirt_hosted_engine_ha.broker.listener.ConnectionHandler::(handle) Connection closed Thread-482200::INFO::2016-06-14 12:59:38,601::listener::134::ovirt_hosted_engine_ha.broker.listener.ConnectionHandler::(setup) Connection established Thread-482200::INFO::2016-06-14 12:59:38,602::listener::186::ovirt_hosted_engine_ha.broker.listener.ConnectionHandler::(handle) Connection closed Thread-482179::INFO::2016-06-14 12:59:40,339::cpu_load_no_engine::121::cpu_load_no_engine.EngineHealth::(calculate_load) System load total=0.0078, engine=0.0000, non-engine=0.0078
Thread-482178::INFO::2016-06-14 12:59:49,745::mem_free::53::mem_free.MemFree::(action) memFree: 26500 Thread-481977::ERROR::2016-06-14 12:59:50,263::notifications::35::ovirt_hosted_engine_ha.broker.notifications.Notifications::(send_email) [Errno 110] Connection timed out Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/broker/notifications.py", line 24, in send_email server = smtplib.SMTP(cfg["smtp-server"], port=cfg["smtp-port"]) File "/usr/lib64/python2.7/smtplib.py", line 255, in __init__ (code, msg) = self.connect(host, port) File "/usr/lib64/python2.7/smtplib.py", line 315, in connect self.sock = self._get_socket(host, port, self.timeout) File "/usr/lib64/python2.7/smtplib.py", line 290, in _get_socket return socket.create_connection((host, port), timeout) File "/usr/lib64/python2.7/socket.py", line 571, in create_connection raise err error: [Errno 110] Connection timed out Thread-481977::INFO::2016-06-14 12:59:50,264::listener::186::ovirt_hosted_engine_ha.broker.listener.ConnectionHandler::(handle) Connection closed Thread-481977::INFO::2016-06-14 12:59:50,264::monitor::90::ovirt_hosted_engine_ha.broker.monitor.Monitor::(stop_submonitor) Stopping submonitor ping, id 140681792007632 Thread-481977::INFO::2016-06-14 12:59:50,264::monitor::99::ovirt_hosted_engine_ha.broker.monitor.Monitor::(stop_submonitor) Stopped submonitor ping, id 140681792007632 Thread-481977::INFO::2016-06-14 12:59:50,264::monitor::90::ovirt_hosted_engine_ha.broker.monitor.Monitor::(stop_submonitor) Stopping submonitor mgmt-bridge, id 140681925896272 Thread-481977::INFO::2016-06-14 12:59:50,264::monitor::99::ovirt_hosted_engine_ha.broker.monitor.Monitor::(stop_submonitor) Stopped submonitor mgmt-bridge, id 140681925896272 Thread-481977::INFO::2016-06-14 12:59:50,264::monitor::90::ovirt_hosted_engine_ha.broker.monitor.Monitor::(stop_submonitor) Stopping submonitor mem-free, id 140681926005456 Thread-481977::INFO::2016-06-14 12:59:50,264::monitor::99::ovirt_hosted_engine_ha.broker.monitor.Monitor::(stop_submonitor) Stopped submonitor mem-free, id 140681926005456 Thread-481977::INFO::2016-06-14 12:59:50,264::monitor::90::ovirt_hosted_engine_ha.broker.monitor.Monitor::(stop_submonitor) Stopping submonitor cpu-load-no-engine, id 140681926012880 Thread-481977::INFO::2016-06-14 12:59:50,264::monitor::99::ovirt_hosted_engine_ha.broker.monitor.Monitor::(stop_submonitor) Stopped submonitor cpu-load-no-engine, id 140681926012880 Thread-481977::INFO::2016-06-14 12:59:50,264::monitor::90::ovirt_hosted_engine_ha.broker.monitor.Monitor::(stop_submonitor) Stopping submonitor engine-health, id 140681926011984 Thread-481977::INFO::2016-06-14 12:59:50,264::monitor::99::ovirt_hosted_engine_ha.broker.monitor.Monitor::(stop_submonitor) Stopped submonitor engine-health, id 140681926011984
On Wed, Jun 15, 2016 at 10:04 AM, Martin Sivak <msivak@redhat.com> wrote:
Charles, check the broker log too please. It is possible that the broker process is running, but is not accepting connections for example.
Martin
On Wed, Jun 15, 2016 at 3:32 PM, Charles Kozler <charles@fixflyer.com> wrote:
Actually, broker is the only thing acting "right" between broker and agent. Broker is up when I bring the system up but agent is restarting all the time. Have a look
The 11th is when I restarted this node after doing 'reinstall' in the web UI
● ovirt-ha-broker.service - oVirt Hosted Engine High Availability Communications Broker Loaded: loaded (/usr/lib/systemd/system/ovirt-ha-broker.service; enabled; vendor preset: disabled) Active: active (running) since Sat 2016-06-11 13:09:51 EDT; 3 days ago Main PID: 1285 (ovirt-ha-broker) CGroup: /system.slice/ovirt-ha-broker.service └─1285 /usr/bin/python /usr/share/ovirt-hosted-engine-ha/ovirt-ha-broker --no-daemon
Jun 15 09:23:56 njsevcnp01 ovirt-ha-broker[1285]: INFO:mgmt_bridge.MgmtBridge:Found bridge ovirtmgmt with ports Jun 15 09:23:58 njsevcnp01 ovirt-ha-broker[1285]: INFO:ovirt_hosted_engine_ha.broker.listener.ConnectionHandler:Connection established Jun 15 09:23:58 njsevcnp01 ovirt-ha-broker[1285]: INFO:ovirt_hosted_engine_ha.broker.listener.ConnectionHandler:Connection closed Jun 15 09:23:58 njsevcnp01 ovirt-ha-broker[1285]: INFO:ovirt_hosted_engine_ha.broker.listener.ConnectionHandler:Connection established Jun 15 09:23:58 njsevcnp01 ovirt-ha-broker[1285]: INFO:ovirt_hosted_engine_ha.broker.listener.ConnectionHandler:Connection closed Jun 15 09:23:58 njsevcnp01 ovirt-ha-broker[1285]: INFO:ovirt_hosted_engine_ha.broker.listener.ConnectionHandler:Connection established Jun 15 09:23:58 njsevcnp01 ovirt-ha-broker[1285]: INFO:ovirt_hosted_engine_ha.broker.listener.ConnectionHandler:Connection closed Jun 15 09:23:58 njsevcnp01 ovirt-ha-broker[1285]: INFO:ovirt_hosted_engine_ha.broker.listener.ConnectionHandler:Connection established Jun 15 09:23:58 njsevcnp01 ovirt-ha-broker[1285]: INFO:ovirt_hosted_engine_ha.broker.listener.ConnectionHandler:Connection closed Jun 15 09:23:58 njsevcnp01 ovirt-ha-broker[1285]: INFO:mem_free.MemFree:memFree: 26408
Uptime of proc ..
# ps -Aef | grep -i broker vdsm 1285 1 2 Jun11 ? 02:27:50 /usr/bin/python /usr/share/ovirt-hosted-engine-ha/ovirt-ha-broker --no-daemon
But the agent... is restarting all the time
# ps -Aef | grep -i ovirt-ha-agent vdsm 76116 1 0 09:19 ? 00:00:01 /usr/bin/python /usr/share/ovirt-hosted-engine-ha/ovirt-ha-agent --no-daemon
9:19 AM ET is last restart. Even the logs say it
[root@njsevcnp01 ovirt-hosted-engine-ha]# grep -i 'restarting agent' agent.log | wc -l 232719
And the restarts every
[root@njsevcnp01 ovirt-hosted-engine-ha]# tail -n 300 agent.log | grep -i 'restarting agent' MainThread::WARNING::2016-06-15
09:23:53,029::agent::208::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent) Restarting agent, attempt '6' MainThread::WARNING::2016-06-15
09:24:28,953::agent::208::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent) Restarting agent, attempt '7' MainThread::WARNING::2016-06-15
09:25:04,879::agent::208::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent) Restarting agent, attempt '8' MainThread::WARNING::2016-06-15
09:25:40,790::agent::208::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent) Restarting agent, attempt '9' MainThread::WARNING::2016-06-15
09:26:17,136::agent::208::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent) Restarting agent, attempt '0' MainThread::WARNING::2016-06-15
09:26:53,063::agent::208::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent) Restarting agent, attempt '1'
Full log of restart is like this saying "connection timed out" but its not saying to *what* is timing out, so I have nothing else to really go on here
[root@njsevcnp01 ovirt-hosted-engine-ha]# tail -n 300 agent.log | grep -i restart MainThread::ERROR::2016-06-15
09:24:23,948::agent::205::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent) Error: 'Failed to start monitor <type 'type'>, options {'hostname': 'njsevcnp01'}: Connection timed out' - trying to restart agent MainThread::WARNING::2016-06-15
09:24:28,953::agent::208::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent) Restarting agent, attempt '7' MainThread::ERROR::2016-06-15
09:24:59,874::agent::205::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent) Error: 'Failed to start monitor <type 'type'>, options {'hostname': 'njsevcnp01'}: Connection timed out' - trying to restart agent MainThread::WARNING::2016-06-15
09:25:04,879::agent::208::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent) Restarting agent, attempt '8' MainThread::ERROR::2016-06-15
09:25:35,785::agent::205::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent) Error: 'Failed to start monitor <type 'type'>, options {'hostname': 'njsevcnp01'}: Connection timed out' - trying to restart agent MainThread::WARNING::2016-06-15
09:25:40,790::agent::208::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent) Restarting agent, attempt '9' MainThread::ERROR::2016-06-15
09:26:12,131::agent::205::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent) Error: 'Failed to start monitor <type 'type'>, options {'hostname': 'njsevcnp01'}: Connection timed out' - trying to restart agent MainThread::WARNING::2016-06-15
09:26:17,136::agent::208::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent) Restarting agent, attempt '0' MainThread::ERROR::2016-06-15
09:26:48,058::agent::205::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent) Error: 'Failed to start monitor <type 'type'>, options {'hostname': 'njsevcnp01'}: Connection timed out' - trying to restart agent MainThread::WARNING::2016-06-15
09:26:53,063::agent::208::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent) Restarting agent, attempt '1' MainThread::ERROR::2016-06-15
09:27:23,969::agent::205::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent) Error: 'Failed to start monitor <type 'type'>, options {'hostname': 'njsevcnp01'}: Connection timed out' - trying to restart agent MainThread::WARNING::2016-06-15
09:27:28,973::agent::208::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent) Restarting agent, attempt '2'
Storage is also completely fine. No logs stating anything "going away" or having issues. Engine has dedicated NFS NAS device meanwhile VM storage is completely separate storage cluster. Storage has 100% dedicated backend network with no changes being done
On Wed, Jun 15, 2016 at 7:42 AM, Martin Sivak <msivak@redhat.com> wrote:
Jun 14 08:11:11 njsevcnp01 ovirt-ha-agent[15713]: ovirt-ha-agent ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink ERROR Connection closed: Connection timed out Jun 14 08:11:11 njsevcnp01.fixflyer.com ovirt-ha-agent[15713]: ovirt-ha-agent ovirt_hosted_engine_ha.agent.agent.Agent ERROR Error: 'Failed to start monitor <type 'type'>, options {'hostname': 'njsevcnp01'}: Connection timed out' - trying to restart agent
Broker is broken or down. Check the status of ovirt-ha-broker service.
The other interesting thing is this log from node01. The odd thing is that it seems there is some split brain somewhere in oVirt because this log is from node02 but it is asking the engine and its getting back "vm not running on this host' rather than 'stale data'. But I dont know engine internals
This is another piece that points to broker or storage issues. Agent collects local data and then publishes them to other nodes through broker. So it is possible for the agent to know the status of the VM locally, but not be able to publish it.
hosted-engine command line tool then reads the synchronization whiteboard too, but it does not see anything that was not published and ends up reporting stale data.
What is the status of the hosted engine services? systemctl status ovirt-ha-agent ovirt-ha-broker
Please check the services.
Best regards
Martin
On Tue, Jun 14, 2016 at 2:16 PM, Charles Kozler <charles@fixflyer.com> wrote:
Martin -
One thing I noticed on all of the nodes is this:
Jun 14 08:11:11 njsevcnp01 ovirt-ha-agent[15713]: ovirt-ha-agent ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink ERROR Connection closed: Connection timed out Jun 14 08:11:11 njsevcnp01.fixflyer.com ovirt-ha-agent[15713]: ovirt-ha-agent ovirt_hosted_engine_ha.agent.agent.Agent ERROR Error: 'Failed to start monitor <type 'type'>, options {'hostname': 'njsevcnp01'}: Connection timed out' - trying to restart agent
Then the agent is restarted
[root@njsevcnp01 ~]# ps -Aef | grep -i ovirt-ha-agent | grep -iv grep vdsm 15713 1 0 08:09 ? 00:00:01 /usr/bin/python /usr/share/ovirt-hosted-engine-ha/ovirt-ha-agent --no-daemon
I dont know why the connection would time out because as you can see that log is from node01 and I cant figure out why its timing out on the connection
The other interesting thing is this log from node01. The odd thing is that it seems there is some split brain somewhere in oVirt because this log is from node02 but it is asking the engine and its getting back "vm not running on this host' rather than 'stale data'. But I dont know engine internals
MainThread::INFO::2016-06-14
08:13:05,163::state_machine::171::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(refresh) Host njsevcnp02 (id 2): {hostname: njsevcnp02, host-id: 2, engine-status: {reason: vm not running on this host, health: bad, vm: down, detail: unknown}, score: 0, stopped: True, maintenance: False, crc32: 25da07df, host-ts: 3030} MainThread::INFO::2016-06-14
08:13:05,163::state_machine::171::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(refresh) Host njsevcnp03 (id 3): {hostname: njsevcnp03, host-id: 3, engine-status: {reason: vm not running on this host, health: bad, vm: down, detail: unknown}, score: 0, stopped: True, maintenance: False, crc32: c67818cb, host-ts: 10877406}
And that same log on node02 where the engine is running
MainThread::INFO::2016-06-14
08:15:44,451::state_machine::171::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(refresh) Host njsevcnp01 (id 1): {hostname: njsevcnp01, host-id: 1, engine-status: {reason: vm not running on this host, health: bad, vm: down, detail: unknown}, score: 0, stopped: True, maintenance: False, crc32: 260dbf06, host-ts: 327} MainThread::INFO::2016-06-14
08:15:44,451::state_machine::171::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(refresh) Host njsevcnp03 (id 3): {hostname: njsevcnp03, host-id: 3, engine-status: {reason: vm not running on this host, health: bad, vm: down, detail: unknown}, score: 0, stopped: True, maintenance: False, crc32: c67818cb, host-ts: 10877406} MainThread::INFO::2016-06-14
08:15:44,451::state_machine::174::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(refresh) Local (id 2): {engine-health: {health: good, vm: up, detail: up}, bridge: True, mem-free: 20702.0, maintenance: False, cpu-load: None, gateway: True} MainThread::INFO::2016-06-14
08:15:44,452::brokerlink::111::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify) Trying: notify time=1465906544.45 type=state_transition detail=StartState-ReinitializeFSM hostname=njsevcnp02
On Tue, Jun 14, 2016 at 7:59 AM, Martin Sivak <msivak@redhat.com> wrote:
Hi,
is there anything interesting in the hosted engine log files? /var/log/ovirt-hosted-engine-ha/agent.log
There should be something appearing there every 10 seconds or faster.
What is the status of the hosted engine services? systemctl status ovirt-ha-agent ovirt-ha-broker
Best regards
-- Martin Sivak SLA / oVirt
On Sat, Jun 11, 2016 at 8:53 PM, Charles Kozler <charles@fixflyer.com> wrote: > See linked images please. As you can see all three nodes are > reporting > stale > data. The results of this are: > > 1. Not all VM's migrate seamlessly in the cluster. Sometimes I > have > to > shut > them down to get them to be able to migrate again > > 2. Hosted engine refuses to move due to constraints (image). This > part > doesnt make sense to me because I can forcefully shut it down and > then > go > directly on a hosted engine node and bring it back up. Also, the > Web > UI > shows all nodes under the cluster except then it thinks its not > apart > of > the > cluster > > 3. Time is in sync (image) > > 4. Storage is 100% fine. Gluster back end reports mirroring and > status > 'started'. No split brain has occurred and ovirt nodes have never > lost > connectivity to storage > > 5. I reinstalled all three nodes. For some reason only node 3 > still > shows as > having updates available. (image). For clarity, I did not click > "upgrade" I > simply did 'reinstall' from the Web UI. Having looked at the > output > and > yum.log from /var/log it almost looks like it did do an update. > All > package > versions across all three nodes are the same (respective to > ovirt/vdsm) > (image). For some reason though > ovirt-engine-appliance-3.6-20160126.1.el7.centos.noarch exists on > node 1 > but > not on node 2 or 3. Could this be relative? I dont recall > installing > that > specifically on node 1 but I may have > > Been slamming my head on this so I am hoping you can provide some > assistance > > http://imgur.com/a/6xkaS > > Thanks! > > -- > > Charles Kozler > Vice President, IT Operations > > FIX Flyer, LLC > 225 Broadway | Suite 1600 | New York, NY 10007 > 1-888-349-3593 > http://www.fixflyer.com > > NOTICE TO RECIPIENT: THIS E-MAIL IS MEANT ONLY FOR THE INTENDED > RECIPIENT(S) > OF THE TRANSMISSION, AND CONTAINS CONFIDENTIAL INFORMATION WHICH > IS > PROPRIETARY TO FIX FLYER LLC. ANY UNAUTHORIZED USE, COPYING, > DISTRIBUTION, > OR DISSEMINATION IS STRICTLY PROHIBITED. ALL RIGHTS TO THIS > INFORMATION > IS > RESERVED BY FIX FLYER LLC. IF YOU ARE NOT THE INTENDED RECIPIENT, > PLEASE > CONTACT THE SENDER BY REPLY E-MAIL AND PLEASE DELETE THIS E-MAIL > FROM > YOUR > SYSTEM AND DESTROY ANY COPIES. > > _______________________________________________ > Users mailing list > Users@ovirt.org > http://lists.ovirt.org/mailman/listinfo/users >
--
Charles Kozler Vice President, IT Operations
FIX Flyer, LLC 225 Broadway | Suite 1600 | New York, NY 10007 1-888-349-3593 http://www.fixflyer.com
NOTICE TO RECIPIENT: THIS E-MAIL IS MEANT ONLY FOR THE INTENDED RECIPIENT(S) OF THE TRANSMISSION, AND CONTAINS CONFIDENTIAL INFORMATION WHICH IS PROPRIETARY TO FIX FLYER LLC. ANY UNAUTHORIZED USE, COPYING, DISTRIBUTION, OR DISSEMINATION IS STRICTLY PROHIBITED. ALL RIGHTS TO THIS INFORMATION IS RESERVED BY FIX FLYER LLC. IF YOU ARE NOT THE INTENDED RECIPIENT, PLEASE CONTACT THE SENDER BY REPLY E-MAIL AND PLEASE DELETE THIS E-MAIL FROM YOUR SYSTEM AND DESTROY ANY COPIES.
--
Charles Kozler Vice President, IT Operations
FIX Flyer, LLC 225 Broadway | Suite 1600 | New York, NY 10007 1-888-349-3593 http://www.fixflyer.com
NOTICE TO RECIPIENT: THIS E-MAIL IS MEANT ONLY FOR THE INTENDED RECIPIENT(S) OF THE TRANSMISSION, AND CONTAINS CONFIDENTIAL INFORMATION WHICH IS PROPRIETARY TO FIX FLYER LLC. ANY UNAUTHORIZED USE, COPYING, DISTRIBUTION, OR DISSEMINATION IS STRICTLY PROHIBITED. ALL RIGHTS TO THIS INFORMATION IS RESERVED BY FIX FLYER LLC. IF YOU ARE NOT THE INTENDED RECIPIENT, PLEASE CONTACT THE SENDER BY REPLY E-MAIL AND PLEASE DELETE THIS E-MAIL FROM YOUR SYSTEM AND DESTROY ANY COPIES.
--
Charles Kozler Vice President, IT Operations
FIX Flyer, LLC 225 Broadway | Suite 1600 | New York, NY 10007 1-888-349-3593 http://www.fixflyer.com
NOTICE TO RECIPIENT: THIS E-MAIL IS MEANT ONLY FOR THE INTENDED RECIPIENT(S) OF THE TRANSMISSION, AND CONTAINS CONFIDENTIAL INFORMATION WHICH IS PROPRIETARY TO FIX FLYER LLC. ANY UNAUTHORIZED USE, COPYING, DISTRIBUTION, OR DISSEMINATION IS STRICTLY PROHIBITED. ALL RIGHTS TO THIS INFORMATION IS RESERVED BY FIX FLYER LLC. IF YOU ARE NOT THE INTENDED RECIPIENT, PLEASE CONTACT THE SENDER BY REPLY E-MAIL AND PLEASE DELETE THIS E-MAIL FROM YOUR SYSTEM AND DESTROY ANY COPIES.

Thread-482175::INFO::2016-06-14
12:59:30,429::storage_backends::120::ovirt_hosted_engine_ha.lib.storage_backends::(_check_symlinks)
Cleaning up stale LV link '/rhev/data-center/mnt/nas01:_volume1_vm__os_ovirt
36__engine/c6323975-2966-409d-b9e0-48370a513a98/ha_agent/hosted-engine.metadata'
This is also not normal, it means the storage disappeared.
This seems to indicate there is some kind of issue with your network.. are you sure that your firewall allows connections over lo interface and to the storage server?
Yes very much so. The network is 10.0.16.0/24 - this is the ovirtmgmt + storage network and is 100% isolated and dedicated with no firewall between oVirt nodes and storage. There is no firewall on the local server either. Basically I have: ovirtmgmt - bond0 in mode 2 (default when not using LACP in oVirt it appears) - connects to dedicated storage switches. nodes1-3 are 10.0.16.5, 6, and 7 respectively VM NIC - bond1 - trunk port for VLAN tagging in active/passive bond. This is the VM network path. This connects to two different switches storage is located at 10.0.16.100 (cluster IP / storage-vip is hostname), 10.0.16.101 (storage node 1), 10.0.16.102 (storage node 2), 10.0.16.103 (nas01, dedicated storage for ovirt engine outside of clustered storage for other VMs) Cluster IP of 10.0.16.100 is where VM storage goes NAS IP of 10.0.16.103 is where oVirt engine storage is All paths to the oVirt engine and other nodes are 100% clear with no failures or firewalls between oVirt nodes and storage [root@njsevcnp01 ~]# for i in $( seq 100 103 ); do ping -c 1 10.0.16.$i | grep -i "\(rece\|time=\)"; echo "--"; done 64 bytes from 10.0.16.100: icmp_seq=1 ttl=64 time=0.071 ms 1 packets transmitted, 1 received, 0% packet loss, time 0ms -- 64 bytes from 10.0.16.101: icmp_seq=1 ttl=64 time=0.065 ms 1 packets transmitted, 1 received, 0% packet loss, time 0ms -- 64 bytes from 10.0.16.102: icmp_seq=1 ttl=64 time=0.099 ms 1 packets transmitted, 1 received, 0% packet loss, time 0ms -- 64 bytes from 10.0.16.103: icmp_seq=1 ttl=64 time=0.219 ms 1 packets transmitted, 1 received, 0% packet loss, time 0ms -- This is dedicated storage for oVirt environment [root@njsevcnp01 ~]# df -h | grep -i rhev nas01:/volume1/vm_os/ovirt36_engine 2.2T 295G 1.9T 14% /rhev/data-center/mnt/nas01:_volume1_vm__os_ovirt36__engine storage-vip:/fast_ha-gv0 792G 125G 668G 16% /rhev/data-center/mnt/glusterSD/storage-vip:_fast__ha-gv0 storage-vip:/slow_nonha-gv0 1.8T 212G 1.6T 12% /rhev/data-center/mnt/glusterSD/storage-vip:_slow__nonha-gv0
09:24:59,874::agent::205::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent)
Error: 'Failed to start monitor <type 'type'>, options {'hostname': 'njsevcnp01'}: Connection timed out' - trying to restart agent MainThread::WARNING::2016-06-15
and connection timeout between agent and broker.
Everything I am providing right now is from njsevcnp01, why would it timeout between agent and broker on the same box? Because broker is not accepting connection? But the broker logs show it is accepting and doing connection handling Acknowledged on the STMP errors. At this time I am just trying to get clustering working again because as of now I cannot live migrate the hosted engine since it appears to be a split brain type of issue What do I need to do to resolve this stale-data issue and get the cluster working again / agents and brokers talking to themselves again? Should I shut down the platform and delete the lock files then bring it back up again? Thanks for your help Martin! On Wed, Jun 15, 2016 at 10:38 AM, Martin Sivak <msivak@redhat.com> wrote:
"/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/broker/notifications.py",
line 24, in send_email server = smtplib.SMTP(cfg["smtp-server"], port=cfg["smtp-port"]) File "/usr/lib64/python2.7/smtplib.py", line 255, in __init__ (code, msg) = self.connect(host, port) File "/usr/lib64/python2.7/smtplib.py", line 315, in connect self.sock = self._get_socket(host, port, self.timeout) File "/usr/lib64/python2.7/smtplib.py", line 290, in _get_socket return socket.create_connection((host, port), timeout) File "/usr/lib64/python2.7/socket.py", line 571, in create_connection raise err error: [Errno 110] Connection timed out
So you have connection timeout here (it is trying to reach the localhost smtp server)
09:24:59,874::agent::205::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent)
Error: 'Failed to start monitor <type 'type'>, options {'hostname': 'njsevcnp01'}: Connection timed out' - trying to restart agent MainThread::WARNING::2016-06-15
and connection timeout between agent and broker.
Thread-482175::INFO::2016-06-14
12:59:30,429::storage_backends::120::ovirt_hosted_engine_ha.lib.storage_backends::(_check_symlinks)
Cleaning up stale LV link '/rhev/data-center/mnt/nas01:_volume1_vm__os_ovirt
36__engine/c6323975-2966-409d-b9e0-48370a513a98/ha_agent/hosted-engine.metadata'
This is also not normal, it means the storage disappeared.
This seems to indicate there is some kind of issue with your network.. are you sure that your firewall allows connections over lo interface and to the storage server?
Martin
On Wed, Jun 15, 2016 at 4:11 PM, Charles Kozler <charles@fixflyer.com> wrote:
Marin -
Anything I should be looking for specifically? The only errors I see are smtp errors when it tries to send a notification but nothing indicating what the notification is / might be. I see this repeated about every minute
Thread-482115::INFO::2016-06-14
12:58:54,431::listener::134::ovirt_hosted_engine_ha.broker.listener.ConnectionHandler::(setup)
Connection established Thread-482109::INFO::2016-06-14
12:58:54,491::storage_backends::120::ovirt_hosted_engine_ha.lib.storage_backends::(_check_symlinks)
Cleaning up stale LV link '/rhev/data-center/mnt/nas01:_volume1_vm__os_ovirt
36__engine/c6323975-2966-409d-b9e0-48370a513a98/ha_agent/hosted-engine.lockspace'
Thread-482109::INFO::2016-06-14
12:58:54,515::storage_backends::120::ovirt_hosted_engine_ha.lib.storage_backends::(_check_symlinks)
Cleaning up stale LV link '/rhev/data-center/mnt/nas01:_volume1_vm__os_ovirt
36__engine/c6323975-2966-409d-b9e0-48370a513a98/ha_agent/hosted-engine.metadata'
nas01 is the primary storage for the engine (as previously noted)
Thread-482175::INFO::2016-06-14
12:59:30,398::storage_backends::120::ovirt_hosted_engine_ha.lib.storage_backends::(_check_symlinks)
Cleaning up stale LV link '/rhev/data-center/mnt/nas01:_volume1_vm__os_ovirt
36__engine/c6323975-2966-409d-b9e0-48370a513a98/ha_agent/hosted-engine.lockspace'
Thread-482175::INFO::2016-06-14
12:59:30,429::storage_backends::120::ovirt_hosted_engine_ha.lib.storage_backends::(_check_symlinks)
Cleaning up stale LV link '/rhev/data-center/mnt/nas01:_volume1_vm__os_ovirt
36__engine/c6323975-2966-409d-b9e0-48370a513a98/ha_agent/hosted-engine.metadata'
But otherwise the broker looks like its accepting and handling
connections
Thread-481980::INFO::2016-06-14 12:59:33,105::mem_free::53::mem_free.MemFree::(action) memFree: 26491 Thread-482193::INFO::2016-06-14
12:59:33,977::listener::134::ovirt_hosted_engine_ha.broker.listener.ConnectionHandler::(setup)
Connection established Thread-482193::INFO::2016-06-14
12:59:34,033::listener::186::ovirt_hosted_engine_ha.broker.listener.ConnectionHandler::(handle)
Connection closed Thread-482194::INFO::2016-06-14
12:59:34,034::listener::134::ovirt_hosted_engine_ha.broker.listener.ConnectionHandler::(setup)
Connection established Thread-482194::INFO::2016-06-14
12:59:34,035::listener::186::ovirt_hosted_engine_ha.broker.listener.ConnectionHandler::(handle)
Connection closed Thread-482195::INFO::2016-06-14
12:59:34,035::listener::134::ovirt_hosted_engine_ha.broker.listener.ConnectionHandler::(setup)
Connection established Thread-482195::INFO::2016-06-14
12:59:34,036::listener::186::ovirt_hosted_engine_ha.broker.listener.ConnectionHandler::(handle)
Connection closed Thread-482196::INFO::2016-06-14
12:59:34,037::listener::134::ovirt_hosted_engine_ha.broker.listener.ConnectionHandler::(setup)
Connection established Thread-482196::INFO::2016-06-14
12:59:34,037::listener::186::ovirt_hosted_engine_ha.broker.listener.ConnectionHandler::(handle)
Connection closed Thread-482197::INFO::2016-06-14
12:59:38,544::listener::134::ovirt_hosted_engine_ha.broker.listener.ConnectionHandler::(setup)
Connection established Thread-482197::INFO::2016-06-14
12:59:38,598::listener::186::ovirt_hosted_engine_ha.broker.listener.ConnectionHandler::(handle)
Connection closed Thread-482198::INFO::2016-06-14
12:59:38,598::listener::134::ovirt_hosted_engine_ha.broker.listener.ConnectionHandler::(setup)
Connection established Thread-482198::INFO::2016-06-14
12:59:38,599::listener::186::ovirt_hosted_engine_ha.broker.listener.ConnectionHandler::(handle)
Connection closed Thread-482199::INFO::2016-06-14
12:59:38,600::listener::134::ovirt_hosted_engine_ha.broker.listener.ConnectionHandler::(setup)
Connection established Thread-482199::INFO::2016-06-14
12:59:38,600::listener::186::ovirt_hosted_engine_ha.broker.listener.ConnectionHandler::(handle)
Connection closed Thread-482200::INFO::2016-06-14
12:59:38,601::listener::134::ovirt_hosted_engine_ha.broker.listener.ConnectionHandler::(setup)
Connection established Thread-482200::INFO::2016-06-14
12:59:38,602::listener::186::ovirt_hosted_engine_ha.broker.listener.ConnectionHandler::(handle)
Connection closed Thread-482179::INFO::2016-06-14
12:59:40,339::cpu_load_no_engine::121::cpu_load_no_engine.EngineHealth::(calculate_load)
System load total=0.0078, engine=0.0000, non-engine=0.0078
Thread-482178::INFO::2016-06-14 12:59:49,745::mem_free::53::mem_free.MemFree::(action) memFree: 26500 Thread-481977::ERROR::2016-06-14
12:59:50,263::notifications::35::ovirt_hosted_engine_ha.broker.notifications.Notifications::(send_email)
[Errno 110] Connection timed out Traceback (most recent call last): File
"/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/broker/notifications.py",
line 24, in send_email server = smtplib.SMTP(cfg["smtp-server"], port=cfg["smtp-port"]) File "/usr/lib64/python2.7/smtplib.py", line 255, in __init__ (code, msg) = self.connect(host, port) File "/usr/lib64/python2.7/smtplib.py", line 315, in connect self.sock = self._get_socket(host, port, self.timeout) File "/usr/lib64/python2.7/smtplib.py", line 290, in _get_socket return socket.create_connection((host, port), timeout) File "/usr/lib64/python2.7/socket.py", line 571, in create_connection raise err error: [Errno 110] Connection timed out Thread-481977::INFO::2016-06-14
12:59:50,264::listener::186::ovirt_hosted_engine_ha.broker.listener.ConnectionHandler::(handle)
Connection closed Thread-481977::INFO::2016-06-14
12:59:50,264::monitor::90::ovirt_hosted_engine_ha.broker.monitor.Monitor::(stop_submonitor)
Stopping submonitor ping, id 140681792007632 Thread-481977::INFO::2016-06-14
12:59:50,264::monitor::99::ovirt_hosted_engine_ha.broker.monitor.Monitor::(stop_submonitor)
Stopped submonitor ping, id 140681792007632 Thread-481977::INFO::2016-06-14
12:59:50,264::monitor::90::ovirt_hosted_engine_ha.broker.monitor.Monitor::(stop_submonitor)
Stopping submonitor mgmt-bridge, id 140681925896272 Thread-481977::INFO::2016-06-14
12:59:50,264::monitor::99::ovirt_hosted_engine_ha.broker.monitor.Monitor::(stop_submonitor)
Stopped submonitor mgmt-bridge, id 140681925896272 Thread-481977::INFO::2016-06-14
12:59:50,264::monitor::90::ovirt_hosted_engine_ha.broker.monitor.Monitor::(stop_submonitor)
Stopping submonitor mem-free, id 140681926005456 Thread-481977::INFO::2016-06-14
12:59:50,264::monitor::99::ovirt_hosted_engine_ha.broker.monitor.Monitor::(stop_submonitor)
Stopped submonitor mem-free, id 140681926005456 Thread-481977::INFO::2016-06-14
12:59:50,264::monitor::90::ovirt_hosted_engine_ha.broker.monitor.Monitor::(stop_submonitor)
Stopping submonitor cpu-load-no-engine, id 140681926012880 Thread-481977::INFO::2016-06-14
12:59:50,264::monitor::99::ovirt_hosted_engine_ha.broker.monitor.Monitor::(stop_submonitor)
Stopped submonitor cpu-load-no-engine, id 140681926012880 Thread-481977::INFO::2016-06-14
12:59:50,264::monitor::90::ovirt_hosted_engine_ha.broker.monitor.Monitor::(stop_submonitor)
Stopping submonitor engine-health, id 140681926011984 Thread-481977::INFO::2016-06-14
Stopped submonitor engine-health, id 140681926011984
On Wed, Jun 15, 2016 at 10:04 AM, Martin Sivak <msivak@redhat.com> wrote:
Charles, check the broker log too please. It is possible that the broker process is running, but is not accepting connections for example.
Martin
On Wed, Jun 15, 2016 at 3:32 PM, Charles Kozler <charles@fixflyer.com> wrote:
Actually, broker is the only thing acting "right" between broker and agent. Broker is up when I bring the system up but agent is restarting all
12:59:50,264::monitor::99::ovirt_hosted_engine_ha.broker.monitor.Monitor::(stop_submonitor) the
time. Have a look
The 11th is when I restarted this node after doing 'reinstall' in the web UI
● ovirt-ha-broker.service - oVirt Hosted Engine High Availability Communications Broker Loaded: loaded (/usr/lib/systemd/system/ovirt-ha-broker.service; enabled; vendor preset: disabled) Active: active (running) since Sat 2016-06-11 13:09:51 EDT; 3 days ago Main PID: 1285 (ovirt-ha-broker) CGroup: /system.slice/ovirt-ha-broker.service └─1285 /usr/bin/python /usr/share/ovirt-hosted-engine-ha/ovirt-ha-broker --no-daemon
Jun 15 09:23:56 njsevcnp01 ovirt-ha-broker[1285]: INFO:mgmt_bridge.MgmtBridge:Found bridge ovirtmgmt with ports Jun 15 09:23:58 njsevcnp01 ovirt-ha-broker[1285]:
INFO:ovirt_hosted_engine_ha.broker.listener.ConnectionHandler:Connection
established Jun 15 09:23:58 njsevcnp01 ovirt-ha-broker[1285]:
INFO:ovirt_hosted_engine_ha.broker.listener.ConnectionHandler:Connection
closed Jun 15 09:23:58 njsevcnp01 ovirt-ha-broker[1285]:
INFO:ovirt_hosted_engine_ha.broker.listener.ConnectionHandler:Connection
established Jun 15 09:23:58 njsevcnp01 ovirt-ha-broker[1285]:
INFO:ovirt_hosted_engine_ha.broker.listener.ConnectionHandler:Connection
closed Jun 15 09:23:58 njsevcnp01 ovirt-ha-broker[1285]:
INFO:ovirt_hosted_engine_ha.broker.listener.ConnectionHandler:Connection
established Jun 15 09:23:58 njsevcnp01 ovirt-ha-broker[1285]:
INFO:ovirt_hosted_engine_ha.broker.listener.ConnectionHandler:Connection
closed Jun 15 09:23:58 njsevcnp01 ovirt-ha-broker[1285]:
INFO:ovirt_hosted_engine_ha.broker.listener.ConnectionHandler:Connection
established Jun 15 09:23:58 njsevcnp01 ovirt-ha-broker[1285]:
INFO:ovirt_hosted_engine_ha.broker.listener.ConnectionHandler:Connection
closed Jun 15 09:23:58 njsevcnp01 ovirt-ha-broker[1285]: INFO:mem_free.MemFree:memFree: 26408
Uptime of proc ..
# ps -Aef | grep -i broker vdsm 1285 1 2 Jun11 ? 02:27:50 /usr/bin/python /usr/share/ovirt-hosted-engine-ha/ovirt-ha-broker --no-daemon
But the agent... is restarting all the time
# ps -Aef | grep -i ovirt-ha-agent vdsm 76116 1 0 09:19 ? 00:00:01 /usr/bin/python /usr/share/ovirt-hosted-engine-ha/ovirt-ha-agent --no-daemon
9:19 AM ET is last restart. Even the logs say it
[root@njsevcnp01 ovirt-hosted-engine-ha]# grep -i 'restarting agent' agent.log | wc -l 232719
And the restarts every
[root@njsevcnp01 ovirt-hosted-engine-ha]# tail -n 300 agent.log | grep -i 'restarting agent' MainThread::WARNING::2016-06-15
09:23:53,029::agent::208::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent)
Restarting agent, attempt '6' MainThread::WARNING::2016-06-15
09:24:28,953::agent::208::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent)
Restarting agent, attempt '7' MainThread::WARNING::2016-06-15
09:25:04,879::agent::208::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent)
Restarting agent, attempt '8' MainThread::WARNING::2016-06-15
09:25:40,790::agent::208::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent)
Restarting agent, attempt '9' MainThread::WARNING::2016-06-15
09:26:17,136::agent::208::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent)
Restarting agent, attempt '0' MainThread::WARNING::2016-06-15
09:26:53,063::agent::208::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent)
Restarting agent, attempt '1'
Full log of restart is like this saying "connection timed out" but its not saying to *what* is timing out, so I have nothing else to really go on here
[root@njsevcnp01 ovirt-hosted-engine-ha]# tail -n 300 agent.log | grep -i restart MainThread::ERROR::2016-06-15
09:24:23,948::agent::205::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent)
Error: 'Failed to start monitor <type 'type'>, options {'hostname': 'njsevcnp01'}: Connection timed out' - trying to restart agent MainThread::WARNING::2016-06-15
09:24:28,953::agent::208::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent)
Restarting agent, attempt '7' MainThread::ERROR::2016-06-15
09:24:59,874::agent::205::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent)
Error: 'Failed to start monitor <type 'type'>, options {'hostname': 'njsevcnp01'}: Connection timed out' - trying to restart agent MainThread::WARNING::2016-06-15
09:25:04,879::agent::208::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent)
Restarting agent, attempt '8' MainThread::ERROR::2016-06-15
09:25:35,785::agent::205::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent)
Error: 'Failed to start monitor <type 'type'>, options {'hostname': 'njsevcnp01'}: Connection timed out' - trying to restart agent MainThread::WARNING::2016-06-15
09:25:40,790::agent::208::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent)
Restarting agent, attempt '9' MainThread::ERROR::2016-06-15
09:26:12,131::agent::205::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent)
Error: 'Failed to start monitor <type 'type'>, options {'hostname': 'njsevcnp01'}: Connection timed out' - trying to restart agent MainThread::WARNING::2016-06-15
09:26:17,136::agent::208::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent)
Restarting agent, attempt '0' MainThread::ERROR::2016-06-15
09:26:48,058::agent::205::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent)
Error: 'Failed to start monitor <type 'type'>, options {'hostname': 'njsevcnp01'}: Connection timed out' - trying to restart agent MainThread::WARNING::2016-06-15
09:26:53,063::agent::208::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent)
Restarting agent, attempt '1' MainThread::ERROR::2016-06-15
09:27:23,969::agent::205::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent)
Error: 'Failed to start monitor <type 'type'>, options {'hostname': 'njsevcnp01'}: Connection timed out' - trying to restart agent MainThread::WARNING::2016-06-15
09:27:28,973::agent::208::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent)
Restarting agent, attempt '2'
Storage is also completely fine. No logs stating anything "going away" or having issues. Engine has dedicated NFS NAS device meanwhile VM storage is completely separate storage cluster. Storage has 100% dedicated backend network with no changes being done
On Wed, Jun 15, 2016 at 7:42 AM, Martin Sivak <msivak@redhat.com> wrote:
Jun 14 08:11:11 njsevcnp01 ovirt-ha-agent[15713]: ovirt-ha-agent ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink ERROR Connection closed: Connection timed out Jun 14 08:11:11 njsevcnp01.fixflyer.com ovirt-ha-agent[15713]: ovirt-ha-agent ovirt_hosted_engine_ha.agent.agent.Agent ERROR
Error:
'Failed to start monitor <type 'type'>, options {'hostname': 'njsevcnp01'}: Connection timed out' - trying to restart agent
Broker is broken or down. Check the status of ovirt-ha-broker service.
The other interesting thing is this log from node01. The odd thing is that it seems there is some split brain somewhere in oVirt because this log is from node02 but it is asking the engine and its getting back "vm not running on this host' rather than 'stale data'. But I dont know engine internals
This is another piece that points to broker or storage issues. Agent collects local data and then publishes them to other nodes through broker. So it is possible for the agent to know the status of the VM locally, but not be able to publish it.
hosted-engine command line tool then reads the synchronization whiteboard too, but it does not see anything that was not published and ends up reporting stale data.
> What is the status of the hosted engine services? systemctl status > ovirt-ha-agent ovirt-ha-broker
Please check the services.
Best regards
Martin
On Tue, Jun 14, 2016 at 2:16 PM, Charles Kozler < charles@fixflyer.com> wrote:
Martin -
One thing I noticed on all of the nodes is this:
Jun 14 08:11:11 njsevcnp01 ovirt-ha-agent[15713]: ovirt-ha-agent ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink ERROR Connection closed: Connection timed out Jun 14 08:11:11 njsevcnp01.fixflyer.com ovirt-ha-agent[15713]: ovirt-ha-agent ovirt_hosted_engine_ha.agent.agent.Agent ERROR Error: 'Failed to start monitor <type 'type'>, options {'hostname': 'njsevcnp01'}: Connection timed out' - trying to restart agent
Then the agent is restarted
[root@njsevcnp01 ~]# ps -Aef | grep -i ovirt-ha-agent | grep -iv grep vdsm 15713 1 0 08:09 ? 00:00:01 /usr/bin/python /usr/share/ovirt-hosted-engine-ha/ovirt-ha-agent --no-daemon
I dont know why the connection would time out because as you can see that log is from node01 and I cant figure out why its timing out on the connection
The other interesting thing is this log from node01. The odd thing is that it seems there is some split brain somewhere in oVirt because this log is from node02 but it is asking the engine and its getting back "vm not running on this host' rather than 'stale data'. But I dont know engine internals
MainThread::INFO::2016-06-14
08:13:05,163::state_machine::171::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(refresh)
Host njsevcnp02 (id 2): {hostname: njsevcnp02, host-id: 2, engine-status: {reason: vm not running on this host, health: bad, vm: down, detail: unknown}, score: 0, stopped: True, maintenance: False, crc32: 25da07df, host-ts: 3030} MainThread::INFO::2016-06-14
08:13:05,163::state_machine::171::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(refresh)
Host njsevcnp03 (id 3): {hostname: njsevcnp03, host-id: 3, engine-status: {reason: vm not running on this host, health: bad, vm: down, detail: unknown}, score: 0, stopped: True, maintenance: False, crc32: c67818cb, host-ts: 10877406}
And that same log on node02 where the engine is running
MainThread::INFO::2016-06-14
08:15:44,451::state_machine::171::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(refresh)
Host njsevcnp01 (id 1): {hostname: njsevcnp01, host-id: 1, engine-status: {reason: vm not running on this host, health: bad, vm: down, detail: unknown}, score: 0, stopped: True, maintenance: False, crc32: 260dbf06, host-ts: 327} MainThread::INFO::2016-06-14
08:15:44,451::state_machine::171::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(refresh)
Host njsevcnp03 (id 3): {hostname: njsevcnp03, host-id: 3, engine-status: {reason: vm not running on this host, health: bad, vm: down, detail: unknown}, score: 0, stopped: True, maintenance: False, crc32: c67818cb, host-ts: 10877406} MainThread::INFO::2016-06-14
08:15:44,451::state_machine::174::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(refresh)
Local (id 2): {engine-health: {health: good, vm: up, detail: up}, bridge: True, mem-free: 20702.0, maintenance: False, cpu-load: None, gateway: True} MainThread::INFO::2016-06-14
08:15:44,452::brokerlink::111::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify)
Trying: notify time=1465906544.45 type=state_transition detail=StartState-ReinitializeFSM hostname=njsevcnp02
On Tue, Jun 14, 2016 at 7:59 AM, Martin Sivak <msivak@redhat.com> wrote: > > Hi, > > is there anything interesting in the hosted engine log files? > /var/log/ovirt-hosted-engine-ha/agent.log > > There should be something appearing there every 10 seconds or > faster. > > What is the status of the hosted engine services? systemctl status > ovirt-ha-agent ovirt-ha-broker > > > Best regards > > -- > Martin Sivak > SLA / oVirt > > On Sat, Jun 11, 2016 at 8:53 PM, Charles Kozler > <charles@fixflyer.com> > wrote: > > See linked images please. As you can see all three nodes are > > reporting > > stale > > data. The results of this are: > > > > 1. Not all VM's migrate seamlessly in the cluster. Sometimes I > > have > > to > > shut > > them down to get them to be able to migrate again > > > > 2. Hosted engine refuses to move due to constraints (image). This > > part > > doesnt make sense to me because I can forcefully shut it down and > > then > > go > > directly on a hosted engine node and bring it back up. Also, the > > Web > > UI > > shows all nodes under the cluster except then it thinks its not > > apart > > of > > the > > cluster > > > > 3. Time is in sync (image) > > > > 4. Storage is 100% fine. Gluster back end reports mirroring and > > status > > 'started'. No split brain has occurred and ovirt nodes have never > > lost > > connectivity to storage > > > > 5. I reinstalled all three nodes. For some reason only node 3 > > still > > shows as > > having updates available. (image). For clarity, I did not click > > "upgrade" I > > simply did 'reinstall' from the Web UI. Having looked at the > > output > > and > > yum.log from /var/log it almost looks like it did do an update. > > All > > package > > versions across all three nodes are the same (respective to > > ovirt/vdsm) > > (image). For some reason though > > ovirt-engine-appliance-3.6-20160126.1.el7.centos.noarch exists on > > node 1 > > but > > not on node 2 or 3. Could this be relative? I dont recall > > installing > > that > > specifically on node 1 but I may have > > > > Been slamming my head on this so I am hoping you can provide some > > assistance > > > > http://imgur.com/a/6xkaS > > > > Thanks! > > > > -- > > > > Charles Kozler > > Vice President, IT Operations > > > > FIX Flyer, LLC > > 225 Broadway | Suite 1600 | New York, NY 10007 > > 1-888-349-3593 > > http://www.fixflyer.com > > > > NOTICE TO RECIPIENT: THIS E-MAIL IS MEANT ONLY FOR THE INTENDED > > RECIPIENT(S) > > OF THE TRANSMISSION, AND CONTAINS CONFIDENTIAL INFORMATION WHICH > > IS > > PROPRIETARY TO FIX FLYER LLC. ANY UNAUTHORIZED USE, COPYING, > > DISTRIBUTION, > > OR DISSEMINATION IS STRICTLY PROHIBITED. ALL RIGHTS TO THIS > > INFORMATION > > IS > > RESERVED BY FIX FLYER LLC. IF YOU ARE NOT THE INTENDED RECIPIENT, > > PLEASE > > CONTACT THE SENDER BY REPLY E-MAIL AND PLEASE DELETE THIS E-MAIL > > FROM > > YOUR > > SYSTEM AND DESTROY ANY COPIES. > > > > _______________________________________________ > > Users mailing list > > Users@ovirt.org > > http://lists.ovirt.org/mailman/listinfo/users > >
--
Charles Kozler Vice President, IT Operations
FIX Flyer, LLC 225 Broadway | Suite 1600 | New York, NY 10007 1-888-349-3593 http://www.fixflyer.com
NOTICE TO RECIPIENT: THIS E-MAIL IS MEANT ONLY FOR THE INTENDED RECIPIENT(S) OF THE TRANSMISSION, AND CONTAINS CONFIDENTIAL INFORMATION WHICH IS PROPRIETARY TO FIX FLYER LLC. ANY UNAUTHORIZED USE, COPYING, DISTRIBUTION, OR DISSEMINATION IS STRICTLY PROHIBITED. ALL RIGHTS TO THIS INFORMATION IS RESERVED BY FIX FLYER LLC. IF YOU ARE NOT THE INTENDED RECIPIENT, PLEASE CONTACT THE SENDER BY REPLY E-MAIL AND PLEASE DELETE THIS E-MAIL FROM YOUR SYSTEM AND DESTROY ANY COPIES.
--
Charles Kozler Vice President, IT Operations
FIX Flyer, LLC 225 Broadway | Suite 1600 | New York, NY 10007 1-888-349-3593 http://www.fixflyer.com
NOTICE TO RECIPIENT: THIS E-MAIL IS MEANT ONLY FOR THE INTENDED RECIPIENT(S) OF THE TRANSMISSION, AND CONTAINS CONFIDENTIAL INFORMATION WHICH IS PROPRIETARY TO FIX FLYER LLC. ANY UNAUTHORIZED USE, COPYING, DISTRIBUTION, OR DISSEMINATION IS STRICTLY PROHIBITED. ALL RIGHTS TO THIS INFORMATION IS RESERVED BY FIX FLYER LLC. IF YOU ARE NOT THE INTENDED RECIPIENT, PLEASE CONTACT THE SENDER BY REPLY E-MAIL AND PLEASE DELETE THIS E-MAIL FROM YOUR SYSTEM AND DESTROY ANY COPIES.
--
Charles Kozler Vice President, IT Operations
FIX Flyer, LLC 225 Broadway | Suite 1600 | New York, NY 10007 1-888-349-3593 http://www.fixflyer.com
NOTICE TO RECIPIENT: THIS E-MAIL IS MEANT ONLY FOR THE INTENDED RECIPIENT(S) OF THE TRANSMISSION, AND CONTAINS CONFIDENTIAL INFORMATION WHICH IS PROPRIETARY TO FIX FLYER LLC. ANY UNAUTHORIZED USE, COPYING, DISTRIBUTION, OR DISSEMINATION IS STRICTLY PROHIBITED. ALL RIGHTS TO THIS INFORMATION IS RESERVED BY FIX FLYER LLC. IF YOU ARE NOT THE INTENDED RECIPIENT, PLEASE CONTACT THE SENDER BY REPLY E-MAIL AND PLEASE DELETE THIS E-MAIL FROM YOUR SYSTEM AND DESTROY ANY COPIES.
-- *Charles Kozler* *Vice President, IT Operations* FIX Flyer, LLC 225 Broadway | Suite 1600 | New York, NY 10007 1-888-349-3593 http://www.fixflyer.com <http://fixflyer.com> NOTICE TO RECIPIENT: THIS E-MAIL IS MEANT ONLY FOR THE INTENDED RECIPIENT(S) OF THE TRANSMISSION, AND CONTAINS CONFIDENTIAL INFORMATION WHICH IS PROPRIETARY TO FIX FLYER LLC. ANY UNAUTHORIZED USE, COPYING, DISTRIBUTION, OR DISSEMINATION IS STRICTLY PROHIBITED. ALL RIGHTS TO THIS INFORMATION IS RESERVED BY FIX FLYER LLC. IF YOU ARE NOT THE INTENDED RECIPIENT, PLEASE CONTACT THE SENDER BY REPLY E-MAIL AND PLEASE DELETE THIS E-MAIL FROM YOUR SYSTEM AND DESTROY ANY COPIES.
participants (4)
-
Alexis HAUSER
-
Charles Kozler
-
Martin Sivak
-
Sahina Bose