[ovirt-users] Ovirt host activation and lvm looping with high CPU load trying to mount iSCSI storage
Nir Soffer
nsoffer at redhat.com
Tue Jan 17 09:37:59 UTC 2017
On Mon, Jan 16, 2017 at 7:19 PM, Mark Greenall
<m.greenall at iontrading.com> wrote:
> Hi,
>
> To try and get a baseline here I've reverted most of the changes we've made and am running the host with just the following iSCSI related configuration settings. The tweaks had been made over time to try and alleviate several storage related problems, but it's possible that fixes in Ovirt (we've gradually gone from early 3.x to 4.0.6) make them redundant now and they simply compound the problem. I'll start with these configuration settings and then move onto trying the vdsm patch.
>
> /etc/multipath.conf (note: polling_interval and max_fds would not get accepted in the devices section. I think they are for default only):
Write, my error.
>
> # VDSM REVISION 1.3
> # VDSM PRIVATE
>
> blacklist {
> devnode "^(ram|raw|loop|fd|md|dm-|sr|scd|st)[0-9]*"
> devnode "^hd[a-z]"
> devnode "^sda$"
> }
>
> defaults {
> deferred_remove yes
> dev_loss_tmo 30
> fast_io_fail_tmo 5
> flush_on_last_del yes
> max_fds 4096
You can bump the number of fds here if this is needed.
> no_path_retry fail
> polling_interval 5
> user_friendly_names no
> }
>
> devices {
> device {
> vendor "EQLOGIC"
> product "100E-00"
>
> # Ovirt defaults
> deferred_remove yes
> dev_loss_tmo 30
> fast_io_fail_tmo 5
> flush_on_last_del yes
> # polling_interval 5
> user_friendly_names no
>
> # Local settings
> # max_fds 8192
> path_checker tur
> path_grouping_policy multibus
> path_selector "round-robin 0"
>
> # Use 4 retries will provide additional 20 seconds gracetime when no
> # path is available before the device is disabled. (assuming 5 seconds
> # polling interval). This may prevent vms from pausing when there is
> # short outage on the storage server or network.
> no_path_retry 4
> }
>
> device {
> # These settings overrides built-in devices settings. It does not apply
> # to devices without built-in settings (these use the settings in the
> # "defaults" section), or to devices defined in the "devices" section.
> all_devs yes
> no_path_retry fail
> }
> }
>
>
> /etc/iscsi/iscsid.conf default apart from:
>
> node.session.initial_login_retry_max = 12
> node.session.cmds_max = 1024
> node.session.queue_depth = 128
> node.startup = manual
> node.session.iscsi.FastAbort = No
I don't know about these options, I would try to defaults first, unless
you can explain why they are needed.
>
>
>
>
> The following settings have been commented out / removed:
>
> /etc/sysctl.conf:
>
> # For more information, see sysctl.conf(5) and sysctl.d(5).
> # Prevent ARP Flux for multiple NICs on the same subnet:
> #net.ipv4.conf.all.arp_ignore = 1
> #net.ipv4.conf.all.arp_announce = 2
> # Loosen RP Filter to alow multiple iSCSI connections
> #net.ipv4.conf.all.rp_filter = 2
You need these if you are connecting to to addresses on the same subnet.
Vdsm will do this automatically if needed, if this is configured properly on
the engine side. Unfortunately I don't know to configure this on the
engine side,
but maybe other users using same configuration know.
>
>
> /lib/udev/rules.d:
>
> # Various Settings for Dell Equallogic disks based on Dell Optimizing SAN Environment for Linux Guide
> #
> # Modify disk scheduler mode to noop
> #ACTION=="add|change", SUBSYSTEM=="block", ATTRS{vendor}=="EQLOGIC", RUN+="/bin/sh -c 'echo noop > /sys/${DEVPATH}/queue/scheduler'"
> # Modify disk timeout value to 60 seconds
> #ACTION!="remove", SUBSYSTEM=="block", ATTRS{vendor}=="EQLOGIC", RUN+="/bin/sh -c 'echo 60 > /sys/%p/device/timeout'"
> # Modify read ahead value to 1024
> #ACTION!="remove", SUBSYSTEM=="block", ATTRS{vendor}=="EQLOGIC", RUN+="/bin/sh -c 'echo 1024 > /sys/${DEVPATH}/bdi/read_ahead_kb'"
>
> I've also removed our defined iSCSI interfaces and have simply left the Ovirt 'default'
The default will probably use only single path for each device, unless you
configure engine to use both nics.
>
> Rebooted and 'Activated' host:
>
> 16:09 - Host Activated
> 16:10 - Non Operational saying it can't access storage domain 'Unknown'
This means a pv is not accessible, smell like connectivity issue with storage.
> 16:12 - Host Activated again
> 16:12 - Host not responding goes 'Connecting'
> 16:15 - Can't access ALL the storage Domains. Host goes Non Operational again
Do you mean it cannot access any storage domain, or it can access only some?
> 16:17 - Host Activated again
> 16:18 - Can't access ALL the storage Domains. Host goes Non Operational again
> 16:20 - Host Autorecovers and goes Activating again
> That cycle repeated until I started getting VDSM timeout messages and the constant LVM processes and high CPU load. @16:30 I rebooted the host and set the status to maintenance.
>
> Second host Activation attempt just resulted in the same cycle as above. Host now doesn't come online at all.
Do you have logs from this session?
>
> Next step will be to try the vdsm patch.
>
> Mark
More information about the Users
mailing list