Hi,
To try and get a baseline here I've reverted most of the changes we've made and am
running the host with just the following iSCSI related configuration settings. The tweaks
had been made over time to try and alleviate several storage related problems, but
it's possible that fixes in Ovirt (we've gradually gone from early 3.x to 4.0.6)
make them redundant now and they simply compound the problem. I'll start with these
configuration settings and then move onto trying the vdsm patch.
/etc/multipath.conf (note: polling_interval and max_fds would not get accepted in the
devices section. I think they are for default only):
# VDSM REVISION 1.3
# VDSM PRIVATE
blacklist {
devnode "^(ram|raw|loop|fd|md|dm-|sr|scd|st)[0-9]*"
devnode "^hd[a-z]"
devnode "^sda$"
}
defaults {
deferred_remove yes
dev_loss_tmo 30
fast_io_fail_tmo 5
flush_on_last_del yes
max_fds 4096
no_path_retry fail
polling_interval 5
user_friendly_names no
}
devices {
device {
vendor "EQLOGIC"
product "100E-00"
# Ovirt defaults
deferred_remove yes
dev_loss_tmo 30
fast_io_fail_tmo 5
flush_on_last_del yes
# polling_interval 5
user_friendly_names no
# Local settings
# max_fds 8192
path_checker tur
path_grouping_policy multibus
path_selector "round-robin 0"
# Use 4 retries will provide additional 20 seconds gracetime when no
# path is available before the device is disabled. (assuming 5 seconds
# polling interval). This may prevent vms from pausing when there is
# short outage on the storage server or network.
no_path_retry 4
}
device {
# These settings overrides built-in devices settings. It does not apply
# to devices without built-in settings (these use the settings in the
# "defaults" section), or to devices defined in the "devices"
section.
all_devs yes
no_path_retry fail
}
}
/etc/iscsi/iscsid.conf default apart from:
node.session.initial_login_retry_max = 12
node.session.cmds_max = 1024
node.session.queue_depth = 128
node.startup = manual
node.session.iscsi.FastAbort = No
The following settings have been commented out / removed:
/etc/sysctl.conf:
# For more information, see sysctl.conf(5) and sysctl.d(5).
# Prevent ARP Flux for multiple NICs on the same subnet:
#net.ipv4.conf.all.arp_ignore = 1
#net.ipv4.conf.all.arp_announce = 2
# Loosen RP Filter to alow multiple iSCSI connections
#net.ipv4.conf.all.rp_filter = 2
/lib/udev/rules.d:
# Various Settings for Dell Equallogic disks based on Dell Optimizing SAN Environment for
Linux Guide
#
# Modify disk scheduler mode to noop
#ACTION=="add|change", SUBSYSTEM=="block",
ATTRS{vendor}=="EQLOGIC", RUN+="/bin/sh -c 'echo noop >
/sys/${DEVPATH}/queue/scheduler'"
# Modify disk timeout value to 60 seconds
#ACTION!="remove", SUBSYSTEM=="block",
ATTRS{vendor}=="EQLOGIC", RUN+="/bin/sh -c 'echo 60 >
/sys/%p/device/timeout'"
# Modify read ahead value to 1024
#ACTION!="remove", SUBSYSTEM=="block",
ATTRS{vendor}=="EQLOGIC", RUN+="/bin/sh -c 'echo 1024 >
/sys/${DEVPATH}/bdi/read_ahead_kb'"
I've also removed our defined iSCSI interfaces and have simply left the Ovirt
'default'
Rebooted and 'Activated' host:
16:09 - Host Activated
16:10 - Non Operational saying it can't access storage domain 'Unknown'
16:12 - Host Activated again
16:12 - Host not responding goes 'Connecting'
16:15 - Can't access ALL the storage Domains. Host goes Non Operational again
16:17 - Host Activated again
16:18 - Can't access ALL the storage Domains. Host goes Non Operational again
16:20 - Host Autorecovers and goes Activating again
That cycle repeated until I started getting VDSM timeout messages and the constant LVM
processes and high CPU load. @16:30 I rebooted the host and set the status to
maintenance.
Second host Activation attempt just resulted in the same cycle as above. Host now
doesn't come online at all.
Next step will be to try the vdsm patch.
Mark