On Wed, Jan 11, 2017 at 9:23 PM, Nir Soffer <nsoffer(a)redhat.com> wrote:
On Wed, Jan 11, 2017 at 7:35 PM, Mark Greenall
<m.greenall(a)iontrading.com> wrote:
> Hi Ovirt Champions,
>
>
>
> I am pulling my hair out and in need of advice / help.
>
>
>
> Host server: Dell PowerEdge R815 (40 cores and 768GB memory)
>
> Stoage: Dell Equallogic (Firmware V8.1.4)
>
> OS: Centos 7.3 (although the same thing happens on 7.2)
>
> Ovirt: 4.0.6.3-1 (although also happens on 4.0.5)
>
>
>
> I can’t exactly pinpoint when this started happening but it’s certainly been
> happening with Ovirt 4.0.5 and CentOS 7.2. Today I updated Hosted Engine and
> one host to 4.0.6 and CentOS 7.3 but we still see the same problem. Our
> hosts are connected to Dell iSCSI Eqallogic storage. We have one storage
> domain defined per VM guest, so do have quite a few LUN’s presented to the
> cluster (around 45 in total).
>
>
>
> Problem Description:
>
> 1) Reboot a host.
>
> 2) Activate a host in Ovirt Admin Gui.
>
> 3) A few minutes later host is shown as activated.
>
> 4) Approx 10-15 mins later host goes offline complaining that it can’t
> connect to storage.
>
> 5) Constantly then loops around (activating, non operational,
> connecting, initialising) and the host ends up with a high CPU load and
> large number of lvm commands in the process tree.
>
> 6) Multipath and iscsi show all storage is available and logged in.
>
> 7) Equallogic shows host connected and no errors.
>
> 8) Admin GUI ends up saying the host can’t connect to storage
> ‘UNKNOWN’.
>
>
>
> The strange thing is that every now and again step 5 doesn’t happen and the
> host will actually activate again and then stays up. However, it still
> takes step 4 to take the host offline first.
>
>
>
> Expected Behaviour:
>
> 1) Reboot a host.
>
> 2) Activate a host in Ovirt Admin Gui.
>
> 3) A few minutes later host is shown as activated.
>
> 4) Begin using host with confidence.
>
>
>
> I’ve attached the engine.log from Hosted Engine and vdsm.log from the host.
> The following is a timeline of the latest event.
>
>
>
> Host Activation : 15:07
>
> Host Up: 15:10
>
> Non-Operational: 15:17
>
>
>
> Seriously hoping someone can spot something obvious as this is making the
> clusters somewhat unstable and unreliable.
Can you share /var/log/messages and /var/log/sanlock.log?
And /etc/multipath.conf
Nir