[ovirt-users] Ovirt host activation and lvm looping with high CPU load trying to mount iSCSI storage

Mark Greenall m.greenall at iontrading.com
Wed Jan 11 17:35:55 UTC 2017


Hi Ovirt Champions,

I am pulling my hair out and in need of advice / help.

Host server: Dell PowerEdge R815 (40 cores and 768GB memory)
Stoage: Dell Equallogic (Firmware V8.1.4)
OS: Centos 7.3 (although the same thing happens on 7.2)
Ovirt: 4.0.6.3-1 (although also happens on 4.0.5)

I can't exactly pinpoint when this started happening but it's certainly been happening with Ovirt 4.0.5 and CentOS 7.2. Today I updated Hosted Engine and one host to 4.0.6 and CentOS 7.3 but we still see the same problem. Our hosts are connected to Dell iSCSI Eqallogic storage. We have one storage domain defined per VM guest, so do have quite a few LUN's presented to the cluster (around 45 in total).

Problem Description:

1)      Reboot a host.

2)      Activate a host in Ovirt Admin Gui.

3)      A few minutes later host is shown as activated.

4)      Approx 10-15 mins later host goes offline complaining that it can't connect to storage.

5)      Constantly then loops around (activating, non operational, connecting, initialising) and the host ends up with a high CPU load and large number of lvm commands in the process tree.

6)      Multipath and iscsi show all storage is available and logged in.

7)      Equallogic shows host connected and no errors.

8)      Admin GUI ends up saying the host can't connect to storage 'UNKNOWN'.

The strange thing is that every now and again step 5 doesn't happen and the host will actually activate again and then stays up.  However, it still takes step 4 to take the host offline first.

Expected Behaviour:

1)      Reboot a host.

2)      Activate a host in Ovirt Admin Gui.

3)      A few minutes later host is shown as activated.

4)      Begin using host with confidence.

I've attached the engine.log from Hosted Engine and vdsm.log from the host. The following is a timeline of the latest event.

Host Activation : 15:07
Host Up: 15:10
Non-Operational: 15:17

Seriously hoping someone can spot something obvious as this is making the clusters somewhat unstable and unreliable.

Many Thanks,
Mark
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ovirt.org/pipermail/users/attachments/20170111/3169bd85/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: Ovirt Debug.zip
Type: application/x-zip-compressed
Size: 699169 bytes
Desc: Ovirt Debug.zip
URL: <http://lists.ovirt.org/pipermail/users/attachments/20170111/3169bd85/attachment-0001.bin>


More information about the Users mailing list