On Thu, Jan 12, 2017 at 12:02 PM, Mark Greenall <m.greenall(a)iontrading.com>
wrote:
Firstly, thanks @Yaniv and thanks @Nir for your responses.
@Yaniv, in answer to this:
>> Why do you have 1 SD per VM?
It's a combination of performance and ease of management. We ran some IO
tests with various configurations and settled on this one for a balance of
reduced IO contention and ease of management. If there is a better
recommended way of handling these then I'm all ears. If you believe having
a large amount of storage domains adds to the problem then we can also
review the setup.
I don't see how it can improve performance. Having several iSCSI
connections to a (single!) target may help, but certainly not too much.
Just from looking at your /var/log/messages:
Jan 11 15:07:11 uk1-ion-ovm-08 iscsid: Connection1:0 to [target:
iqn.2001-05.com.equallogic:4-42a846-37a238a33-4e21185c70857594-uk1-amd-cluster2-template-dstore01,
portal: 10.100.214.77,3260] through [iface: bond1.10] is operational now
Jan 11 15:07:11 uk1-ion-ovm-08 iscsid: Connection2:0 to [target:
iqn.2001-05.com.equallogic:4-42a846-37a238a33-4e21185c70857594-uk1-amd-cluster2-template-dstore01,
portal: 10.100.214.77,3260] through [iface: default] is operational now
Jan 11 15:07:11 uk1-ion-ovm-08 iscsid: Connection3:0 to [target:
iqn.2001-05.com.equallogic:4-42a846-192238a33-1f71185c70b57598-cuuk1ionhurap02-dstore01,
portal: 10.100.214.77,3260] through [iface: bond1.10] is operational now
Jan 11 15:07:11 uk1-ion-ovm-08 iscsid: Connection4:0 to [target:
iqn.2001-05.com.equallogic:4-42a846-192238a33-1f71185c70b57598-cuuk1ionhurap02-dstore01,
portal: 10.100.214.77,3260] through [iface: default] is operational now
Jan 11 15:07:11 uk1-ion-ovm-08 iscsid: Connection5:0 to [target:
iqn.2001-05.com.equallogic:4-42a846-223238a33-7301185c70e57598-cuuk1ionhurdb02-dstore01,
portal: 10.100.214.77,3260] through [iface: bond1.10] is operational now
Jan 11 15:07:11 uk1-ion-ovm-08 iscsid: Connection6:0 to [target:
iqn.2001-05.com.equallogic:4-42a846-223238a33-7301185c70e57598-cuuk1ionhurdb02-dstore01,
portal: 10.100.214.77,3260] through [iface: default] is operational now
Jan 11 15:07:11 uk1-ion-ovm-08 iscsid: Connection7:0 to [target:
iqn.2001-05.com.equallogic:4-42a846-212238a33-2a61185c719576bd-lnd-ion-anv-test-lin-64-dstore01,
portal: 10.100.214.77,3260] through [iface: bond1.10] is operational now
Jan 11 15:07:11 uk1-ion-ovm-08 iscsid: Connection8:0 to [target:
iqn.2001-05.com.equallogic:4-42a846-212238a33-2a61185c719576bd-lnd-ion-anv-test-lin-64-dstore01,
portal: 10.100.214.77,3260] through [iface: default] is operational now
Jan 11 15:07:11 uk1-ion-ovm-08 iscsid: Connection9:0 to [target:
iqn.2001-05.com.equallogic:4-42a846-ad4238a33-1b31185c75157c7e-lnd-ion-lindev-14-dstore01,
portal: 10.100.214.77,3260] through [iface: bond1.10] is operational now
Jan 11 15:07:11 uk1-ion-ovm-08 iscsid: Connection10:0 to [target:
iqn.2001-05.com.equallogic:4-42a846-ad4238a33-1b31185c75157c7e-lnd-ion-lindev-14-dstore01,
portal: 10.100.214.77,3260] through [iface: default] is operational now
Jan 11 15:07:11 uk1-ion-ovm-08 iscsid: Connection11:0 to [target:
iqn.2001-05.com.equallogic:4-42a846-b99479033-9a788b6aa6857d3b-lnd-anv-sup-03-dstore01,
portal: 10.100.214.77,3260] through [iface: bond1.10] is operational now
Jan 11 15:07:11 uk1-ion-ovm-08 iscsid: Connection12:0 to [target:
iqn.2001-05.com.equallogic:4-42a846-b99479033-9a788b6aa6857d3b-lnd-anv-sup-03-dstore01,
portal: 10.100.214.77,3260] through [iface: default] is operational now
Jan 11 15:07:11 uk1-ion-ovm-08 iscsid: Connection13:0 to [target:
iqn.2001-05.com.equallogic:4-42a846-cd9479033-ffc88b6aa6b57d3b-lnd-linsup-02-dstore01,
portal: 10.100.214.77,3260] through [iface: bond1.10] is operational now
Jan 11 15:07:11 uk1-ion-ovm-08 iscsid: Connection14:0 to [target:
iqn.2001-05.com.equallogic:4-42a846-cd9479033-ffc88b6aa6b57d3b-lnd-linsup-02-dstore01,
portal: 10.100.214.77,3260] through [iface: default] is operational now
Jan 11 15:07:11 uk1-ion-ovm-08 iscsid: Connection15:0 to [target:
iqn.2001-05.com.equallogic:4-42a846-db8479033-96f88b6aa6e57d3b-lnd-linsup-03-dstore01,
portal: 10.100.214.77,3260] through [iface: bond1.10] is operational now
Jan 11 15:07:11 uk1-ion-ovm-08 iscsid: Connection16:0 to [target:
iqn.2001-05.com.equallogic:4-42a846-db8479033-96f88b6aa6e57d3b-lnd-linsup-03-dstore01,
portal: 10.100.214.77,3260] through [iface: default] is operational now
Jan 11 15:07:11 uk1-ion-ovm-08 iscsid: Connection17:0 to [target:
iqn.2001-05.com.equallogic:4-42a846-eae479033-f6588b6aa7157d3b-lnd-linsup-04-dstore01,
portal: 10.100.214.77,3260] through [iface: bond1.10] is operational now
Jan 11 15:07:11 uk1-ion-ovm-08 iscsid: Connection18:0 to [target:
iqn.2001-05.com.equallogic:4-42a846-eae479033-f6588b6aa7157d3b-lnd-linsup-04-dstore01,
portal: 10.100.214.77,3260] through [iface: default] is operational now
Jan 11 15:07:11 uk1-ion-ovm-08 iscsid: Connection19:0 to [target:
iqn.2001-05.com.equallogic:4-42a846-fac479033-bf888b6aa7757d3b-lnd-linsup-u01-dstore01,
portal: 10.100.214.77,3260] through [iface: bond1.10] is operational now
Jan 11 15:07:11 uk1-ion-ovm-08 iscsid: Connection20:0 to [target:
iqn.2001-05.com.equallogic:4-42a846-fac479033-bf888b6aa7757d3b-lnd-linsup-u01-dstore01,
portal: 10.100.214.77,3260] through [iface: default] is operational now
1. There is no point in so many connections.
2. Certainly not the same portal - you really should have more.
3. Note that some go via bond1 - and some via 'default' interface. Is that
intended?
4. Your multipath.conf is using rr_min_io - where it should
use rr_min_io_rq most likely.
Unrelated, your engine.log is quite flooded with:
2017-01-11 15:07:46,085 WARN
[org.ovirt.engine.core.vdsbroker.vdsbroker.VdsBrokerObjectsBuilder]
(DefaultQuartzScheduler9) [31a71bf5] Invalid or unknown guest architecture
type '' received from guest agent
Any idea what kind of guest you are running?
You have a lot of host devices - we have patches to improve their
enumeration (coming in 4.0.7)
Y.
>> Can you try and disable (mask) the lvmetad service on the
hosts and see
if it improves matters?
Disabled and masked the lvmetad service and tried again this morning. It
seemed to be less of a load / quicker getting the initial activation of the
host working but the end result was still the same. Just under 10 minutes
later the node went non-operational and the cycle began again. By 09:27 we
had the high CPU load and repeating lvm cycle.
Host Activation: 09:06
Host Up: 09:08
Non-Operational: 09:16
LVM Load: 09:27
Host Reboot: 09:30
From yesterday and today I've attached messages, sanlock.log and
multipath.conf files too. Although I'm not sure the messages file will be
of much use as it looks like log rate limiting kicked in and supressed
messages for the duration of the process. I'm booted off the kernel with
debugging but maybe that's generating too much info? Let me know if you
want me to change anything here to get additional information.
As added configuration information we also have the following settings
from the Equallogic and Linux install guide:
/etc/sysctl.conf:
# Prevent ARP Flux for multiple NICs on the same subnet:
net.ipv4.conf.all.arp_ignore = 1
net.ipv4.conf.all.arp_announce = 2
# Loosen RP Filter to alow multiple iSCSI connections
net.ipv4.conf.all.rp_filter = 2
And the following /lib/udev/rules.d/99-eqlsd.rules:
#-----------------------------------------------------------
------------------
# Copyright (c) 2010-2012 by Dell, Inc.
#
# All rights reserved. This software may not be copied, disclosed,
# transferred, or used except in accordance with a license granted
# by Dell, Inc. This software embodies proprietary information
# and trade secrets of Dell, Inc.
#
#-----------------------------------------------------------
------------------
#
# Various Settings for Dell Equallogic disks based on Dell Optimizing SAN
Environment for Linux Guide
#
# Modify disk scheduler mode to noop
ACTION=="add|change", SUBSYSTEM=="block",
ATTRS{vendor}=="EQLOGIC",
RUN+="/bin/sh -c 'echo noop > /sys/${DEVPATH}/queue/scheduler'"
# Modify disk timeout value to 60 seconds
ACTION!="remove", SUBSYSTEM=="block",
ATTRS{vendor}=="EQLOGIC",
RUN+="/bin/sh -c 'echo 60 > /sys/%p/device/timeout'"
# Modify read ahead value to 1024
ACTION!="remove", SUBSYSTEM=="block",
ATTRS{vendor}=="EQLOGIC",
RUN+="/bin/sh -c 'echo 1024 > /sys/${DEVPATH}/bdi/read_ahead_kb'"
Many Thanks,
Mark