[ovirt-users] problems with power management using idrac7 on r620

Eli Mesika emesika at redhat.com
Wed Jun 17 09:19:21 UTC 2015



----- Original Message -----
> From: "Jason Keltz" <jason.keltz at gmail.com>
> To: "Marek marx Grac" <mgrac at redhat.com>
> Cc: "Eli Mesika" <emesika at redhat.com>, "users" <users at ovirt.org>
> Sent: Wednesday, June 17, 2015 12:02:48 PM
> Subject: Re: problems with power management using idrac7 on r620
> 
> Hi Marek.
> 
> Actually its the idrac that I believe has the memory leak.  Dell wants to
> know how often ovirt is querying the idrac for status and whether the delay
> is configurable.

Well oVirt does not query the status automatically by default 
There is a feature that enables that 
http://www.ovirt.org/Features/PMHealthCheck
Basically this feature depends on 2 configuration values :

PMHealthCheckEnabled that shoul be true if the feature is enabled 
PMHealthCheckIntervalInSec which is defaulted to 3600 Sec , so it is checked in that case once in an hour 

So, first please check if this is enabled in your environment 

engine-config -g PMHealthCheckEnabled

engine-config -g PMHealthCheckIntervalInSec

Other scenario when status is used is when host becomes non-responsive 

In case that host become non responsive : 

After a grace period that depends on the host load and if it is SPM or not a soft-fence attempt (vdsmd service restart) is issued 
If the soft-fence attempt fails we will do a real fencing (if power management is configured correctly on the host and a proxy host is found)
We are sending a STOP command 
We are sending by default 18 status command , one each 10 sec until we get 'off' status from the agent 
We are sending a START command 
We are sending by default 18 status command , one each 10 sec until we get 'on' status from the agent

Those depends on the following configuration variables :

FenceStopStatusRetries - default 18
FenceStopStatusDelayBetweenRetriesInSec - default 10 
FenceStartStatusRetries - default 18
FenceStartStatusDelayBetweenRetriesInSec - default 10 

This can be changed using the engine-config tool (requires restart to take affect)



> 
> Jason.
> On Jun 17, 2015 2:42 AM, "Marek "marx" Grac" <mgrac at redhat.com> wrote:
> 
> >
> >
> > On 06/16/2015 09:37 AM, Eli Mesika wrote:
> >
> >> CCing Marek Grac
> >>
> >> ----- Original Message -----
> >>
> >>> From: "Jason Keltz" <jason.keltz at gmail.com>
> >>> To: "users" <users at ovirt.org>
> >>> Cc: "Eli Mesika" <emesika at redhat.com>
> >>> Sent: Monday, June 15, 2015 11:08:35 PM
> >>> Subject: problems with power management using idrac7 on r620
> >>>
> >>> Hi.
> >>>
> >>> I've been having problem with power management using iDRAC 7 EXPRESS on
> >>> a Dell R620.  This uses a shared LOM as opposed to Enterprise that has a
> >>> dedicated one.   Every now and then, idrac simply stops responding to
> >>> ping, so it can't respond to status commands from the proxy.  If I send
> >>> a reboot with "ipmitool mc reset cold" command, the idrac reboots and
> >>> comes back, but after the problem has occurred, even after a reboot, it
> >>> responds to ping, but drops 80+% of packets.  The only way I can "solve"
> >>> the problem is to physically restart the server.    This isn't just
> >>> happening on  one R620 - it's happening on all of my ovirt hosts.  I
> >>> highly suspect it has to do with a memory leak, and being monitored by
> >>> engine causes the problem.    I had applied a recent firmware upgrade
> >>> that was supposed to "solve" this kind of problem, but it doesn't.  In
> >>> other to provide Dell with more details, can someone tell me how often
> >>> each host is being queried for status?  I can't seem to find that info.
> >>> The idrac on my file server doesn't seem to exhibit the same problem,
> >>> and I suspect that is because it isn't being queried.
> >>>
> >> Hi,
> >
> > fence agent for IPMI is based on ipmitool. So if ping/ipmitool is not
> > working there is not much to do about it. I don't know enough about oVirt
> > engine but there is no real place where fence agent can memory leak because
> > it does not run as daemon.
> >
> > m,
> >
> 



More information about the Users mailing list