[ovirt-users] VDSM Command failed: Heartbeat Exceeded

Neil nwilson123 at gmail.com
Tue Aug 1 04:49:18 UTC 2017


Hi guys,

Sorry to repost but I'm rather desperate here.

Thanks

Regards.

Neil Wilson.

On 31 Jul 2017 16:51, "Neil" <nwilson123 at gmail.com> wrote:

> Hi guys,
>
> Please could someone assist me, my DC seems to be trying to re-negotiate
> SPM and apparently it's failing. I tried to delete an old autogenerated
> snapshot and shortly after that the issue seemed to start, however after
> about an hour, the snapshot said successfully deleted, and then SPM
> negotiated again albeit for a short period before it started trying to
> re-negotiate again.
>
> Last week I upgraded from ovirt 3.5 to 3.6, I also upgraded one of my 4
> hosts using the 3.6 repo to the latest available from that repo and did a
> yum update too.
>
> I have 4 nodes and my ovirt engine is a KVM guest on another physical
> machine on the network. I'm using an FC SAN with ATTO HBA's and recently
> we've started seeing some degraded IO. The SAN appears to be alright and
> the disks all seem to check out, but we are having rather slow IOPS at the
> moment, which we trying to track down.
>
> ovirt engine CentOS release 6.9 (Final)
> ebay-cors-filter-1.0.1-0.1.ovirt.el6.noarch
> ovirt-engine-3.6.7.5-1.el6.noarch
> ovirt-engine-backend-3.6.7.5-1.el6.noarch
> ovirt-engine-cli-3.6.2.0-1.el6.noarch
> ovirt-engine-dbscripts-3.6.7.5-1.el6.noarch
> ovirt-engine-extension-aaa-jdbc-1.0.7-1.el6.noarch
> ovirt-engine-extensions-api-impl-3.6.7.5-1.el6.noarch
> ovirt-engine-jboss-as-7.1.1-1.el6.x86_64
> ovirt-engine-lib-3.6.7.5-1.el6.noarch
> ovirt-engine-restapi-3.6.7.5-1.el6.noarch
> ovirt-engine-sdk-python-3.6.7.0-1.el6.noarch
> ovirt-engine-setup-3.6.7.5-1.el6.noarch
> ovirt-engine-setup-base-3.6.7.5-1.el6.noarch
> ovirt-engine-setup-plugin-ovirt-engine-3.6.7.5-1.el6.noarch
> ovirt-engine-setup-plugin-ovirt-engine-common-3.6.7.5-1.el6.noarch
> ovirt-engine-setup-plugin-vmconsole-proxy-helper-3.6.7.5-1.el6.noarch
> ovirt-engine-setup-plugin-websocket-proxy-3.6.7.5-1.el6.noarch
> ovirt-engine-tools-3.6.7.5-1.el6.noarch
> ovirt-engine-tools-backup-3.6.7.5-1.el6.noarch
> ovirt-engine-userportal-3.6.7.5-1.el6.noarch
> ovirt-engine-vmconsole-proxy-helper-3.6.7.5-1.el6.noarch
> ovirt-engine-webadmin-portal-3.6.7.5-1.el6.noarch
> ovirt-engine-websocket-proxy-3.6.7.5-1.el6.noarch
> ovirt-engine-wildfly-8.2.1-1.el6.x86_64
> ovirt-engine-wildfly-overlay-8.0.5-1.el6.noarch
> ovirt-host-deploy-1.4.1-1.el6.noarch
> ovirt-host-deploy-java-1.4.1-1.el6.noarch
> ovirt-image-uploader-3.6.0-1.el6.noarch
> ovirt-iso-uploader-3.6.0-1.el6.noarch
> ovirt-release34-1.0.3-1.noarch
> ovirt-release35-006-1.noarch
> ovirt-release36-3.6.7-1.noarch
> ovirt-setup-lib-1.0.1-1.el6.noarch
> ovirt-vmconsole-1.0.2-1.el6.noarch
> ovirt-vmconsole-proxy-1.0.2-1.el6.noarch
>
> node01 (CentOS 6.9)
> vdsm-4.16.30-0.el6.x86_64
> vdsm-cli-4.16.30-0.el6.noarch
> vdsm-jsonrpc-4.16.30-0.el6.noarch
> vdsm-python-4.16.30-0.el6.noarch
> vdsm-python-zombiereaper-4.16.30-0.el6.noarch
> vdsm-xmlrpc-4.16.30-0.el6.noarch
> vdsm-yajsonrpc-4.16.30-0.el6.noarch
> gpxe-roms-qemu-0.9.7-6.16.el6.noarch
> qemu-img-rhev-0.12.1.2-2.479.el6_7.2.x86_64
> qemu-kvm-rhev-0.12.1.2-2.479.el6_7.2.x86_64
> qemu-kvm-rhev-tools-0.12.1.2-2.479.el6_7.2.x86_64
> libvirt-0.10.2-62.el6.x86_64
> libvirt-client-0.10.2-62.el6.x86_64
> libvirt-lock-sanlock-0.10.2-62.el6.x86_64
> libvirt-python-0.10.2-62.el6.x86_64
> node01 was upgraded out of desperation after I tried changing my DC and
> cluster version to 3.6, but then found that none of my hosts could be
> activated out of maintenance due to an incompatibility with 3.6 (I'm still
> not sure why as searching seemed to indicate Centos 6.x was compatible. I
> then had to remove all 4 hosts, and change the cluster version back to 3.5
> and then re-add them. When I tried changing the cluster version to 3.6 I
> did get a complaint about using the "legacy protocol" so on each host under
> Advanced, I changed them to use the JSON protocol, and this seemed to
> resolve it, however once changing the DC/Cluster back to 3.5 the option to
> change the protocol back to Legacy is no longer shown.
>
> node02 (Centos 6.7)
> vdsm-4.16.30-0.el6.x86_64
> vdsm-cli-4.16.30-0.el6.noarch
> vdsm-jsonrpc-4.16.30-0.el6.noarch
> vdsm-python-4.16.30-0.el6.noarch
> vdsm-python-zombiereaper-4.16.30-0.el6.noarch
> vdsm-xmlrpc-4.16.30-0.el6.noarch
> vdsm-yajsonrpc-4.16.30-0.el6.noarch
> gpxe-roms-qemu-0.9.7-6.14.el6.noarch
> qemu-img-rhev-0.12.1.2-2.479.el6_7.2.x86_64
> qemu-kvm-rhev-0.12.1.2-2.479.el6_7.2.x86_64
> qemu-kvm-rhev-tools-0.12.1.2-2.479.el6_7.2.x86_64
> libvirt-0.10.2-54.el6_7.6.x86_64
> libvirt-client-0.10.2-54.el6_7.6.x86_64
> libvirt-lock-sanlock-0.10.2-54.el6_7.6.x86_64
> libvirt-python-0.10.2-54.el6_7.6.x86_64
>
> node03 CentOS 6.7
> vdsm-4.16.30-0.el6.x86_64
> vdsm-cli-4.16.30-0.el6.noarch
> vdsm-jsonrpc-4.16.30-0.el6.noarch
> vdsm-python-4.16.30-0.el6.noarch
> vdsm-python-zombiereaper-4.16.30-0.el6.noarch
> vdsm-xmlrpc-4.16.30-0.el6.noarch
> vdsm-yajsonrpc-4.16.30-0.el6.noarch
> gpxe-roms-qemu-0.9.7-6.14.el6.noarch
> qemu-img-rhev-0.12.1.2-2.479.el6_7.2.x86_64
> qemu-kvm-rhev-0.12.1.2-2.479.el6_7.2.x86_64
> qemu-kvm-rhev-tools-0.12.1.2-2.479.el6_7.2.x86_64
> libvirt-0.10.2-54.el6_7.6.x86_64
> libvirt-client-0.10.2-54.el6_7.6.x86_64
> libvirt-lock-sanlock-0.10.2-54.el6_7.6.x86_64
> libvirt-python-0.10.2-54.el6_7.6.x86_64
>
> node04 (Centos 6.7)
> vdsm-4.16.20-1.git3a90f62.el6.x86_64
> vdsm-cli-4.16.20-1.git3a90f62.el6.noarch
> vdsm-jsonrpc-4.16.20-1.git3a90f62.el6.noarch
> vdsm-python-4.16.20-1.git3a90f62.el6.noarch
> vdsm-python-zombiereaper-4.16.20-1.git3a90f62.el6.noarch
> vdsm-xmlrpc-4.16.20-1.git3a90f62.el6.noarch
> vdsm-yajsonrpc-4.16.20-1.git3a90f62.el6.noarch
> gpxe-roms-qemu-0.9.7-6.15.el6.noarch
> qemu-img-0.12.1.2-2.491.el6_8.1.x86_64
> qemu-kvm-0.12.1.2-2.491.el6_8.1.x86_64
> qemu-kvm-tools-0.12.1.2-2.503.el6_9.3.x86_64
> libvirt-0.10.2-60.el6.x86_64
> libvirt-client-0.10.2-60.el6.x86_64
> libvirt-lock-sanlock-0.10.2-60.el6.x86_64
> libvirt-python-0.10.2-60.el6.x86_64
>
> I'm seeing a rather confusing error in the /var/log/messages on all 4
> hosts as follows....
>
> Jul 31 16:41:36 node01 multipathd: 36001b4d80001c80d0000000000000000: sdb
> - directio checker reports path is down
> Jul 31 16:41:41 node01 kernel: sd 7:0:0:0: [sdb]  Result:
> hostbyte=DID_ERROR driverbyte=DRIVER_OK
> Jul 31 16:41:41 node01 kernel: sd 7:0:0:0: [sdb] CDB: Read(10): 28 00 00
> 00 00 00 00 00 01 00
> Jul 31 16:41:41 node01 kernel: end_request: I/O error, dev sdb, sector 0
>
> I say confusing, because I don't have a 3000GB LUN
>
> [root at node01 ~]# fdisk -l | grep 3000
> Disk /dev/sdb: 3000.0 GB, 2999999528960 bytes
>
> I did have one on Friday, last week, but I trashed it and changed it to a
> 1500GB LUN instead, so I'm not sure if perhaps this error is still trying
> to connect to the old LUN perhaps?
>
> My LUNS are as follows...
>
> Disk /dev/sdb: 3000.0 GB, 2999999528960 bytes (this one doesn't actually
> exist anymore)
> Disk /dev/sdc: 1000.0 GB, 999999668224 bytes
> Disk /dev/sdd: 1000.0 GB, 999999668224 bytes
> Disk /dev/sde: 1000.0 GB, 999999668224 bytes
> Disk /dev/sdf: 1000.0 GB, 999999668224 bytes
> Disk /dev/sdg: 1000.0 GB, 999999668224 bytes
> Disk /dev/sdh: 1000.0 GB, 999999668224 bytes
> Disk /dev/sdi: 1000.0 GB, 999999668224 bytes
> Disk /dev/sdj: 1000.0 GB, 999999668224 bytes
> Disk /dev/sdk: 1000.0 GB, 999999668224 bytes
> Disk /dev/sdm: 1000.0 GB, 999999668224 bytes
> Disk /dev/sdl: 1000.0 GB, 999999668224 bytes
> Disk /dev/sdn: 1000.0 GB, 999999668224 bytes
> Disk /dev/sdo: 1000.0 GB, 999999668224 bytes
> Disk /dev/sdp: 1000.0 GB, 999999668224 bytes
> Disk /dev/sdq: 1000.0 GB, 999999668224 bytes
> Disk /dev/sdr: 1000.0 GB, 999988133888 bytes
> Disk /dev/sds: 1500.0 GB, 1499999764480 bytes
> Disk /dev/sdt: 1500.0 GB, 1499999502336 bytes
>
> I'm quite low on SAN disk space currently so I'm a little hesitant to
> migrate VM's around for fear of the migrations creating too many snapshots
> and filling up my SAN. We are in the process of expanding the SAN Array
> too, but we trying to get to the bottom of the bad IOPS at the moment
> before adding on addition overhead.
>
> Ping tests between hosts and engine all look alright, so I don't suspect
> network issues.
>
> I know this is very vague, everything is currently operational, however as
> you can see in the attached logs, I'm getting lots of ERROR messages.
>
> Any help or guidance is greatly appreciated.
>
> Thanks.
>
> Regards.
>
> Neil Wilson.
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ovirt.org/pipermail/users/attachments/20170801/69431d17/attachment.html>


More information about the Users mailing list