Firstly, thanks @Yaniv and thanks @Nir for your responses.
@Yaniv, in answer to this:
>> Why do you have 1 SD per VM?
It's a combination of performance and ease of management. We ran some IO tests with various configurations and settled on this one for a balance of reduced IO contention and ease of management. If there is a better recommended way of handling these then I'm all ears. If you believe having a large amount of storage domains adds to the problem then we can also review the setup.
>> Can you try and disable (mask) the lvmetad service on the hosts and see if it improves matters?
Disabled and masked the lvmetad service and tried again this morning. It seemed to be less of a load / quicker getting the initial activation of the host working but the end result was still the same. Just under 10 minutes later the node went non-operational and the cycle began again. By 09:27 we had the high CPU load and repeating lvm cycle.
Host Activation: 09:06
Host Up: 09:08
Non-Operational: 09:16
LVM Load: 09:27
Host Reboot: 09:30
>From yesterday and today I've attached messages, sanlock.log and multipath.conf files too. Although I'm not sure the messages file will be of much use as it looks like log rate limiting kicked in and supressed messages for the duration of the process. I'm booted off the kernel with debugging but maybe that's generating too much info? Let me know if you want me to change anything here to get additional information.
As added configuration information we also have the following settings from the Equallogic and Linux install guide:
/etc/sysctl.conf:
# Prevent ARP Flux for multiple NICs on the same subnet:
net.ipv4.conf.all.arp_ignore = 1
net.ipv4.conf.all.arp_announce = 2
# Loosen RP Filter to alow multiple iSCSI connections
net.ipv4.conf.all.rp_filter = 2
And the following /lib/udev/rules.d/99-eqlsd.rules:
#----------------------------------------------------------- ------------------
# Copyright (c) 2010-2012 by Dell, Inc.
#
# All rights reserved. This software may not be copied, disclosed,
# transferred, or used except in accordance with a license granted
# by Dell, Inc. This software embodies proprietary information
# and trade secrets of Dell, Inc.
#
#----------------------------------------------------------- ------------------
#
# Various Settings for Dell Equallogic disks based on Dell Optimizing SAN Environment for Linux Guide
#
# Modify disk scheduler mode to noop
ACTION=="add|change", SUBSYSTEM=="block", ATTRS{vendor}=="EQLOGIC", RUN+="/bin/sh -c 'echo noop > /sys/${DEVPATH}/queue/scheduler'"
# Modify disk timeout value to 60 seconds
ACTION!="remove", SUBSYSTEM=="block", ATTRS{vendor}=="EQLOGIC", RUN+="/bin/sh -c 'echo 60 > /sys/%p/device/timeout'"
# Modify read ahead value to 1024
ACTION!="remove", SUBSYSTEM=="block", ATTRS{vendor}=="EQLOGIC", RUN+="/bin/sh -c 'echo 1024 > /sys/${DEVPATH}/bdi/read_ahead_kb'"
Many Thanks,
Mark