[ovirt-users] High latency on storage domains and sanlock renewal error

Stefano Bovina bovy89 at gmail.com
Sun May 7 10:27:17 UTC 2017


Hi,

thanks for the advice. The upgrade is already scheduled, but I would like
to fix this issue before proceeding with a big upgrade (unless an upgrade
will fixes the problem).

The problem is on all hypervisors.

We have 2 cluster (both connected to the same storage system):
 - the old one with FC
 - the new one with FCoE


With dmesg -T and looking at /var/log/messages we found several problems
like these:

1)
[Wed May  3 10:40:11 2017] sd 12:0:0:3: Parameters changed
[Wed May  3 10:40:11 2017] sd 12:0:1:3: Parameters changed
[Wed May  3 10:40:11 2017] sd 12:0:1:1: Parameters changed
[Wed May  3 10:40:12 2017] sd 13:0:0:1: Parameters changed
[Wed May  3 10:40:12 2017] sd 13:0:0:3: Parameters changed
[Wed May  3 10:40:12 2017] sd 13:0:1:3: Parameters changed
[Wed May  3 12:39:32 2017] device-mapper: multipath: Failing path 65:144.
[Wed May  3 12:39:37 2017] sd 13:0:1:2: alua: port group 01 state A
preferred supports tolUsNA

2)
[Wed May  3 17:08:17 2017] perf interrupt took too long (2590 > 2500),
lowering kernel.perf_event_max_sample_rate to 50000

3)
[Wed May  3 19:16:21 2017] bnx2fc: els 0x5: tgt not ready
[Wed May  3 19:16:21 2017] bnx2fc: Relogin to the tgt

4)
sd 13:0:1:0: [sdx] FAILED Result: hostbyte=DID_ERROR driverbyte=DRIVER_OK
sd 13:0:1:0: [sdx] CDB: Read(16) 88 00 00 00 00 00 00 58 08 00 00 00 04 00
00 00
blk_update_request: I/O error, dev sdx, sector 5769216
device-mapper: multipath: Failing path 65:112.
sd 13:0:1:0: alua: port group 01 state A preferred supports tolUsNA

5)
multipathd: 360060160a6213400cce46e40949de411: sdaa - emc_clariion_checker:
Read error for WWN 60060160a6213400cce46e40949de411.  Sense data are
0x0/0x0/0x0.
multipathd: checker failed path 65:160 in map
360060160a6213400cce46e40949de411
multipathd: 360060160a6213400cce46e40949de411: remaining active paths: 3
kernel: device-mapper: multipath: Failing path 65:160.
multipathd: 360060160a6213400cce46e40949de411: sdaa - emc_clariion_checker:
Active path is healthy.
multipathd: 65:160: reinstated
multipathd: 360060160a6213400cce46e40949de411: remaining active paths: 4

6)
[Sat May  6 11:37:07 2017] megaraid_sas 0000:02:00.0: Firmware crash dump
is not available


Multipath configuration is the following (recommended by EMC):

# RHEV REVISION 1.1
# RHEV PRIVATE

devices {
device {
    vendor                  "DGC"
    product                 ".*"
    product_blacklist       "LUNZ"
    path_grouping_policy    group_by_prio
    path_selector           "round-robin 0"
    path_checker            emc_clariion
    features                "1 queue_if_no_path"
    hardware_handler        "1 alua"
    prio                    alua
    failback                immediate
    rr_weight               uniform
    no_path_retry           60
    rr_min_io               1
}
}

Regards,

Stefano


2017-05-07 8:36 GMT+02:00 Yaniv Kaul <ykaul at redhat.com>:

>
>
> On Tue, May 2, 2017 at 11:09 PM, Stefano Bovina <bovy89 at gmail.com> wrote:
>
>> Hi, the engine logs show high latency on storage domains: "Storage domain
>> <xxxx> experienced a high latency of 19.2814 seconds from ...... This may
>> cause performance and functional issues."
>>
>> Looking at host logs, I found also these locking errors:
>>
>> 2017-05-02 20:52:13+0200 33883 [10098]: s1 renewal error -202
>> delta_length 10 last_success 33853
>> 2017-05-02 20:52:19+0200 33889 [10098]: 6a386652 aio collect 0
>> 0x7f1fb80008c0:0x7f1fb80008d0:0x7f1fbe9fb000 result 1048576:0 other free
>> 2017-05-02 21:08:51+0200 34880 [10098]: 6a386652 aio timeout 0
>> 0x7f1fb80008c0:0x7f1fb80008d0:0x7f1fbe4f2000 ioto 10 to_count 24
>> 2017-05-02 21:08:51+0200 34880 [10098]: s1 delta_renew read rv -202
>> offset 0 /dev/6a386652-629d-4045-835b-21d2f5c104aa/ids
>> 2017-05-02 21:08:51+0200 34880 [10098]: s1 renewal error -202
>> delta_length 10 last_success 34850
>> 2017-05-02 21:08:53+0200 34883 [10098]: 6a386652 aio collect 0
>> 0x7f1fb80008c0:0x7f1fb80008d0:0x7f1fbe4f2000 result 1048576:0 other free
>> 2017-05-02 21:30:40+0200 36189 [10098]: 6a386652 aio timeout 0
>> 0x7f1fb80008c0:0x7f1fb80008d0:0x7f1fbe9fb000 ioto 10 to_count 25
>> 2017-05-02 21:30:40+0200 36189 [10098]: s1 delta_renew read rv -202
>> offset 0 /dev/6a386652-629d-4045-835b-21d2f5c104aa/ids
>> 2017-05-02 21:30:40+0200 36189 [10098]: s1 renewal error -202
>> delta_length 10 last_success 36159
>> 2017-05-02 21:30:45+0200 36195 [10098]: 6a386652 aio collect 0
>> 0x7f1fb80008c0:0x7f1fb80008d0:0x7f1fbe9fb000 result 1048576:0 other free
>>
>> and this vdsm errors too:
>>
>> Thread-22::ERROR::2017-05-02 21:53:48,147::sdc::137::Storag
>> e.StorageDomainCache::(_findDomain) looking for unfetched domain
>> f8f21d6c-2425-45c4-aded-4cb9b53ebd96
>> Thread-22::ERROR::2017-05-02 21:53:48,148::sdc::154::Storag
>> e.StorageDomainCache::(_findUnfetchedDomain) looking for domain
>> f8f21d6c-2425-45c4-aded-4cb9b53ebd96
>>
>> Engine instead is showing this errors:
>>
>> 2017-05-02 21:40:38,089 ERROR [org.ovirt.engine.core.vdsbrok
>> er.vdsbroker.SpmStatusVDSCommand] (DefaultQuartzScheduler_Worker-96)
>> Command SpmStatusVDSCommand(HostName = <myhost.example.com>, HostId =
>> dcc0275a-b011-4e33-bb95-366ffb0697b3, storagePoolId =
>> 715d1ba2-eabe-48db-9aea-c28c30359808) execution failed. Exception:
>> VDSErrorException: VDSGenericException: VDSErrorException: Failed to
>> SpmStatusVDS, error = (-202, 'Sanlock resource read failure', 'Sanlock
>> exception'), code = 100
>> 2017-05-02 21:41:08,431 ERROR [org.ovirt.engine.core.vdsbrok
>> er.vdsbroker.SpmStatusVDSCommand] (DefaultQuartzScheduler_Worker-53)
>> [6e0d5ebf] Failed in SpmStatusVDS method
>> 2017-05-02 21:41:08,443 ERROR [org.ovirt.engine.core.vdsbrok
>> er.vdsbroker.SpmStatusVDSCommand] (DefaultQuartzScheduler_Worker-53)
>> [6e0d5ebf] Command SpmStatusVDSCommand(HostName = <myhost.example.com>,
>> HostId = 7991933e-5f30-48cd-88bf-b0b525613384, storagePoolId =
>> 4bd73239-22d0-4c44-ab8c-17adcd580309) execution failed. Exception:
>> VDSErrorException: VDSGenericException: VDSErrorException: Failed to
>> SpmStatusVDS, error = (-202, 'Sanlock resource read failure', 'Sanlock
>> exception'), code = 100
>> 2017-05-02 21:41:31,975 ERROR [org.ovirt.engine.core.vdsbrok
>> er.vdsbroker.SpmStatusVDSCommand] (DefaultQuartzScheduler_Worker-61)
>> [2a54a1b2] Failed in SpmStatusVDS method
>> 2017-05-02 21:41:31,987 ERROR [org.ovirt.engine.core.vdsbrok
>> er.vdsbroker.SpmStatusVDSCommand] (DefaultQuartzScheduler_Worker-61)
>> [2a54a1b2] Command SpmStatusVDSCommand(HostName = <myhost.example.com>,
>> HostId = dcc0275a-b011-4e33-bb95-366ffb0697b3, storagePoolId =
>> 715d1ba2-eabe-48db-9aea-c28c30359808) execution failed. Exception:
>> VDSErrorException: VDSGenericException: VDSErrorException: Failed to
>> SpmStatusVDS, error = (-202, 'Sanlock resource read failure', 'Sanlock
>> exception'), code = 100
>>
>>
>> I'm using Fibre Channel or FCoE connectivity; storage array technical
>> support has analyzed it (also switch and OS configurations), but nothing
>> has been found.
>>
>
> Is this on a specific hosts, or multiple hosts?
> Is that FC or FCoE? Anything on the host's /var/log/messages?
>
>
>>
>> Any advice?
>>
>> Thanks
>>
>>
>> Installation info:
>>
>> ovirt-release35-006-1.noarch
>>
>
> This is a very old release, I suggest, regardless of this issue, to
> upgrade.
> Y.
>
>
>> libgovirt-0.3.3-1.el7_2.1.x86_64
>> vdsm-4.16.30-0.el7.centos.x86_64
>> vdsm-xmlrpc-4.16.30-0.el7.centos.noarch
>> vdsm-yajsonrpc-4.16.30-0.el7.centos.noarch
>> vdsm-jsonrpc-4.16.30-0.el7.centos.noarch
>> vdsm-python-zombiereaper-4.16.30-0.el7.centos.noarch
>> vdsm-python-4.16.30-0.el7.centos.noarch
>> vdsm-cli-4.16.30-0.el7.centos.noarch
>> qemu-kvm-ev-2.3.0-29.1.el7.x86_64
>> qemu-kvm-common-ev-2.3.0-29.1.el7.x86_64
>> qemu-kvm-tools-ev-2.3.0-29.1.el7.x86_64
>> libvirt-client-1.2.17-13.el7_2.3.x86_64
>> libvirt-daemon-driver-storage-1.2.17-13.el7_2.3.x86_64
>> libvirt-python-1.2.17-2.el7.x86_64
>> libvirt-daemon-driver-nwfilter-1.2.17-13.el7_2.3.x86_64
>> libvirt-daemon-driver-nodedev-1.2.17-13.el7_2.3.x86_64
>> libvirt-lock-sanlock-1.2.17-13.el7_2.3.x86_64
>> libvirt-glib-0.1.9-1.el7.x86_64
>> libvirt-daemon-driver-network-1.2.17-13.el7_2.3.x86_64
>> libvirt-daemon-driver-lxc-1.2.17-13.el7_2.3.x86_64
>> libvirt-daemon-driver-interface-1.2.17-13.el7_2.3.x86_64
>> libvirt-1.2.17-13.el7_2.3.x86_64
>> libvirt-daemon-1.2.17-13.el7_2.3.x86_64
>> libvirt-daemon-config-network-1.2.17-13.el7_2.3.x86_64
>> libvirt-daemon-driver-secret-1.2.17-13.el7_2.3.x86_64
>> libvirt-daemon-config-nwfilter-1.2.17-13.el7_2.3.x86_64
>> libvirt-daemon-kvm-1.2.17-13.el7_2.3.x86_64
>> libvirt-daemon-driver-qemu-1.2.17-13.el7_2.3.x86_64
>>
>>
>>
>> ------- vdsm.log (high latency)
>>
>> Thread-21::DEBUG::2017-05-02 18:02:23,646::fileSD::261::Sto
>> rage.Misc.excCmd::(getReadDelay) SUCCESS: <err> = '0+1 records in\n0+1
>> records out\n331 bytes (331 B) copied, 0.000285529 s, 1.2 MB/s\n'; <rc> = 0
>> Thread-18::DEBUG::2017-05-02 18:02:24,335::blockSD::596::St
>> orage.Misc.excCmd::(getReadDelay) /bin/dd if=/dev/2c501858-bf8d-49a5-a42b-bca341b47827/metadata
>> iflag=direct of=/dev/null bs=4096 count=1 (cwd None)
>> Thread-18::DEBUG::2017-05-02 18:02:24,343::blockSD::596::St
>> orage.Misc.excCmd::(getReadDelay) SUCCESS: <err> = '1+0 records in\n1+0
>> records out\n4096 bytes (4.1 kB) copied, 0.000489782 s, 8.4 MB/s\n'; <rc> =
>> 0
>> Thread-19::DEBUG::2017-05-02 18:02:24,635::blockSD::596::St
>> orage.Misc.excCmd::(getReadDelay) /bin/dd if=/dev/c5dc09c4-cd79-42d2-a5af-89953076f79e/metadata
>> iflag=direct of=/dev/null bs=4096 count=1 (cwd None)
>> Thread-19::DEBUG::2017-05-02 18:02:24,643::blockSD::596::St
>> orage.Misc.excCmd::(getReadDelay) SUCCESS: <err> = '1+0 records in\n1+0
>> records out\n4096 bytes (4.1 kB) copied, 0.000558622 s, 7.3 MB/s\n'; <rc> =
>> 0
>> Thread-22::DEBUG::2017-05-02 18:02:24,722::blockSD::596::St
>> orage.Misc.excCmd::(getReadDelay) /bin/dd if=/dev/f8f21d6c-2425-45c4-aded-4cb9b53ebd96/metadata
>> iflag=direct of=/dev/null bs=4096 count=1 (cwd None)
>> Thread-22::DEBUG::2017-05-02 18:02:24,729::blockSD::596::St
>> orage.Misc.excCmd::(getReadDelay) SUCCESS: <err> = '1+0 records in\n1+0
>> records out\n4096 bytes (4.1 kB) copied, 0.000632576 s, 6.5 MB/s\n'; <rc> =
>> 0
>> JsonRpc (StompReactor)::DEBUG::2017-05-02 18:02:25,056::stompReactor::98
>> ::Broker.StompAdapter::(handle_frame) Handling message <StompFrame
>> command='SEND'>
>> JsonRpcServer::DEBUG::2017-05-02 18:02:25,057::__init__::506::j
>> sonrpc.JsonRpcServer::(serve_requests) Waiting for request
>> Thread-9372::DEBUG::2017-05-02 18:02:25,060::stompReactor::16
>> 3::yajsonrpc.StompServer::(send) Sending response
>> Thread-20::DEBUG::2017-05-02 18:02:25,895::blockSD::596::St
>> orage.Misc.excCmd::(getReadDelay) SUCCESS: <err> = '1+0 records in\n1+0
>> records out\n4096 bytes (4.1 kB) copied, 19.2814 s, 0.2 kB/s\n'; <rc> = 0
>> JsonRpc (StompReactor)::DEBUG::2017-05-02 18:02:28,075::stompReactor::98
>> ::Broker.StompAdapter::(handle_frame) Handling message <StompFrame
>> command='SEND'>
>> JsonRpcServer::DEBUG::2017-05-02 18:02:28,076::__init__::506::j
>> sonrpc.JsonRpcServer::(serve_requests) Waiting for request
>> Thread-9373::DEBUG::2017-05-02 18:02:28,078::stompReactor::16
>> 3::yajsonrpc.StompServer::(send) Sending response
>> JsonRpc (StompReactor)::DEBUG::2017-05-02 18:02:31,094::stompReactor::98
>> ::Broker.StompAdapter::(handle_frame) Handling message <StompFrame
>> command='SEND'>
>> JsonRpcServer::DEBUG::2017-05-02 18:02:31,095::__init__::506::j
>> sonrpc.JsonRpcServer::(serve_requests) Waiting for request
>> Thread-9374::DEBUG::2017-05-02 18:02:31,097::stompReactor::16
>> 3::yajsonrpc.StompServer::(send) Sending response
>> Thread-21::DEBUG::2017-05-02 18:02:33,652::fileSD::261::Sto
>> rage.Misc.excCmd::(getReadDelay) /bin/dd if=/rhev/data-center/mnt/<myho
>> st.example.com>:_ISO/f105bdc6-efdc-445c-b49e-aa38c91c2569/dom_md/metadata
>> iflag=direct of=/dev/null bs=4096 count=1 (cwd None)
>> Thread-21::DEBUG::2017-05-02 18:02:33,660::fileSD::261::Sto
>> rage.Misc.excCmd::(getReadDelay) SUCCESS: <err> = '0+1 records in\n0+1
>> records out\n331 bytes (331 B) copied, 0.000246594 s, 1.3 MB/s\n'; <rc> = 0
>> JsonRpc (StompReactor)::DEBUG::2017-05-02 18:02:34,112::stompReactor::98
>> ::Broker.StompAdapter::(handle_frame) Handling message <StompFrame
>> command='SEND'>
>> JsonRpcServer::DEBUG::2017-05-02 18:02:34,114::__init__::506::j
>> sonrpc.JsonRpcServer::(serve_requests) Waiting for request
>> Thread-9375::DEBUG::2017-05-02 18:02:34,117::task::595::Stora
>> ge.TaskManager.Task::(_updateState) Task=`ea63814e-d25d-4436-8bb5-cfc67af6498d`::moving
>> from state init -> state preparing
>> Thread-9375::INFO::2017-05-02 18:02:34,117::logUtils::44::dispatcher::(wrapper)
>> Run and protect: repoStats(options=None)
>> Thread-9375::INFO::2017-05-02 18:02:34,118::logUtils::47::dispatcher::(wrapper)
>> Run and protect: repoStats, Return response: {'2c501858-bf8d-49a5-a42b-bca341b47827':
>> {'code': 0, 'version': 3, 'acquired': True, 'delay': '0.000489782',
>> 'lastCheck': '9.8', 'valid': True}, 'f8f21d6c-2425-45c4-aded-4cb9b53ebd96':
>> {'code': 0, 'version': 3, 'acquired': True, 'delay': '0.000632576',
>> 'lastCheck': '9.4', 'valid': True}, '6a386652-629d-4045-835b-21d2f5c104aa':
>> {'code': 0, 'version': 3, 'acquired': True, 'delay': '19.2814',
>> 'lastCheck': '8.2', 'valid': True}, 'f105bdc6-efdc-445c-b49e-aa38c91c2569':
>> {'code': 0, 'version': 0, 'acquired': True, 'delay': '0.000246594',
>> 'lastCheck': '0.5', 'valid': True}, 'c5dc09c4-cd79-42d2-a5af-89953076f79e':
>> {'code': 0, 'version': 3, 'acquired': True, 'delay': '0.000558622',
>> 'lastCheck': '9.5', 'valid': True}}
>> Thread-9375::DEBUG::2017-05-02 18:02:34,118::task::1191::Stor
>> age.TaskManager.Task::(prepare) Task=`ea63814e-d25d-4436-8bb5-cfc67af6498d`::finished:
>> {'2c501858-bf8d-49a5-a42b-bca341b47827': {'code': 0, 'version': 3,
>> 'acquired': True, 'delay': '0.000489782', 'lastCheck': '9.8', 'valid':
>> True}, 'f8f21d6c-2425-45c4-aded-4cb9b53ebd96': {'code': 0, 'version': 3,
>> 'acquired': True, 'delay': '0.000632576', 'lastCheck': '9.4', 'valid':
>> True}, '6a386652-629d-4045-835b-21d2f5c104aa': {'code': 0, 'version': 3,
>> 'acquired': True, 'delay': '19.2814', 'lastCheck': '8.2', 'valid': True},
>> 'f105bdc6-efdc-445c-b49e-aa38c91c2569': {'code': 0, 'version': 0,
>> 'acquired': True, 'delay': '0.000246594', 'lastCheck': '0.5', 'valid':
>> True}, 'c5dc09c4-cd79-42d2-a5af-89953076f79e': {'code': 0, 'version': 3,
>> 'acquired': True, 'delay': '0.000558622', 'lastCheck': '9.5', 'valid':
>> True}}
>> Thread-9375::DEBUG::2017-05-02 18:02:34,119::task::595::Stora
>> ge.TaskManager.Task::(_updateState) Task=`ea63814e-d25d-4436-8bb5-cfc67af6498d`::moving
>> from state preparing -> state finished
>> Thread-9375::DEBUG::2017-05-02 18:02:34,119::resourceManager:
>> :940::Storage.ResourceManager.Owner::(releaseAll) Owner.releaseAll
>> requests {} resources {}
>> Thread-9375::DEBUG::2017-05-02 18:02:34,119::resourceManager:
>> :977::Storage.ResourceManager.Owner::(cancelAll) Owner.cancelAll
>> requests {}
>> Thread-9375::DEBUG::2017-05-02 18:02:34,119::task::993::Storage.TaskManager.Task::(_decref)
>> Task=`ea63814e-d25d-4436-8bb5-cfc67af6498d`::ref 0 aborting False
>> Thread-9375::DEBUG::2017-05-02 18:02:34,122::stompReactor::16
>> 3::yajsonrpc.StompServer::(send) Sending response
>> JsonRpc (StompReactor)::DEBUG::2017-05-02 18:02:34,128::stompReactor::98
>> ::Broker.StompAdapter::(handle_frame) Handling message <StompFrame
>> command='SEND'>
>> JsonRpcServer::DEBUG::2017-05-02 18:02:34,129::__init__::506::j
>> sonrpc.JsonRpcServer::(serve_requests) Waiting for request
>> Thread-9376::DEBUG::2017-05-02 18:02:34,132::stompReactor::16
>> 3::yajsonrpc.StompServer::(send) Sending response
>>
>> _______________________________________________
>> Users mailing list
>> Users at ovirt.org
>> http://lists.ovirt.org/mailman/listinfo/users
>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ovirt.org/pipermail/users/attachments/20170507/7515befb/attachment-0001.html>


More information about the Users mailing list