On Wed, Jan 11, 2017 at 7:35 PM, Mark Greenall <m.greenall@iontrading.com> wrote:

Hi Ovirt Champions,

 

I am pulling my hair out and in need of advice / help.

 

Host server: Dell PowerEdge R815 (40 cores and 768GB memory)

Stoage: Dell Equallogic (Firmware V8.1.4)

OS: Centos 7.3 (although the same thing happens on 7.2)

Ovirt: 4.0.6.3-1 (although also happens on 4.0.5)

 

I can’t exactly pinpoint when this started happening but it’s certainly been happening with Ovirt 4.0.5 and CentOS 7.2. Today I updated Hosted Engine and one host to 4.0.6 and CentOS 7.3 but we still see the same problem. Our hosts are connected to Dell iSCSI Eqallogic storage. We have one storage domain defined per VM guest, so do have quite a few LUN’s presented to the cluster (around 45 in total).


Why do you have 1 SD per VM?

Can you try and disable (mask) the lvmetad service on the hosts and see if it improves matters?
Also /var/log/messages from the host may give us some clues.
TIA,
Y.
 

 

Problem Description:

1)      Reboot a host.

2)      Activate a host in Ovirt Admin Gui.

3)      A few minutes later host is shown as activated.

4)      Approx 10-15 mins later host goes offline complaining that it can’t connect to storage.

5)      Constantly then loops around (activating, non operational, connecting, initialising) and the host ends up with a high CPU load and large number of lvm commands in the process tree.

6)      Multipath and iscsi show all storage is available and logged in.

7)      Equallogic shows host connected and no errors.

8)      Admin GUI ends up saying the host can’t connect to storage ‘UNKNOWN’.

 

The strange thing is that every now and again step 5 doesn’t happen and the host will actually activate again and then stays up.  However, it still takes step 4 to take the host offline first.

 

Expected Behaviour:

1)      Reboot a host.

2)      Activate a host in Ovirt Admin Gui.

3)      A few minutes later host is shown as activated.

4)      Begin using host with confidence.

 

I’ve attached the engine.log from Hosted Engine and vdsm.log from the host. The following is a timeline of the latest event.

 

Host Activation : 15:07

Host Up: 15:10

Non-Operational: 15:17

 

Seriously hoping someone can spot something obvious as this is making the clusters somewhat unstable and unreliable.

 

Many Thanks,

Mark


_______________________________________________
Users mailing list
Users@ovirt.org
http://lists.ovirt.org/mailman/listinfo/users