[vdsm] Review Request: Add an option to create a watchdog device.

Doron Fediuck dfediuck at redhat.com
Mon Nov 26 17:35:41 UTC 2012



----- Original Message -----
> From: "Ryan Harper" <ryanh at us.ibm.com>
> To: "Doron Fediuck" <dfediuck at redhat.com>
> Cc: "Ryan Harper" <ryanh at us.ibm.com>, "Sheldon" <shaohef at linux.vnet.ibm.com>, arch at ovirt.org, "Zheng Sheng ZS Zhou"
> <zhshzhou at cn.ibm.com>, "Itamar Heim" <iheim at redhat.com>, agl at linux.vnet.ibm.com, "Shu Ming"
> <shuming at linux.vnet.ibm.com>, "Mark Wu" <wudxw at linux.vnet.ibm.com>, snmishra at us.ibm.com, danken at redhat.com
> Sent: Monday, November 26, 2012 5:50:34 PM
> Subject: Re: [vdsm] Review Request: Add an option to create a watchdog device.
> 
> * Doron Fediuck <dfediuck at redhat.com> [2012-11-26 09:20]:
> > ----- Original Message -----
> > > From: "Ryan Harper" <ryanh at us.ibm.com>
> > > To: "Doron Fediuck" <dfediuck at redhat.com>
> > > Cc: "Sheldon" <shaohef at linux.vnet.ibm.com>, arch at ovirt.org,
> > > "Zheng Sheng ZS Zhou" <zhshzhou at cn.ibm.com>, "Itamar
> > > Heim" <iheim at redhat.com>, agl at linux.vnet.ibm.com, "Shu Ming"
> > > <shuming at linux.vnet.ibm.com>, "Mark Wu"
> > > <wudxw at linux.vnet.ibm.com>, ryanh at us.ibm.com,
> > > snmishra at us.ibm.com, danken at redhat.com
> > > Sent: Monday, November 26, 2012 4:01:48 PM
> > > Subject: Re: [vdsm] Review Request: Add an option to create a
> > > watchdog device.
> > > 
> > > * Doron Fediuck <dfediuck at redhat.com> [2012-11-22 03:56]:
> > > > 
> > > > ----- Original Message -----
> > > > 
> > > > > From: "Sheldon" <shaohef at linux.vnet.ibm.com>
> > > > > To: "Doron Fediuck" <dfediuck at redhat.com>
> > > > > Cc: arch at ovirt.org, "Zheng Sheng ZS Zhou"
> > > > > <zhshzhou at cn.ibm.com>,
> > > > > "Itamar Heim" <iheim at redhat.com>, agl at linux.vnet.ibm.com,
> > > > > "Shu
> > > > > Ming"
> > > > > <shuming at linux.vnet.ibm.com>, "Mark Wu"
> > > > > <wudxw at linux.vnet.ibm.com>,
> > > > > ryanh at us.ibm.com, snmishra at us.ibm.com, danken at redhat.com
> > > > > Sent: Thursday, November 22, 2012 11:00:18 AM
> > > > > Subject: Re: [vdsm] Review Request: Add an option to create a
> > > > > watchdog device.
> > > > 
> > > > > On 11/21/2012 04:00 PM, Doron Fediuck wrote:
> > > > 
> > > > > > > Currently, we do not have any plans to implement the
> > > > > > > engine
> > > > > > > side
> > > > > > > of
> > > > > > > the feature.
> > > > > > 
> > > > > 
> > > > > > > But I will add a watchdog feature page to describe how
> > > > > > > engine
> > > > > > > enable
> > > > > > > this feature. It's definitely great if any engine guy
> > > > > > > would
> > > > > > > like
> > > > > > > to
> > > > > > > take the engine part. I will be glad to provide help if
> > > > > > > needed.
> > > > > > 
> > > > > 
> > > > > > Hi Sheldon,
> > > > > 
> > > > > > Any news on the engine side?
> > > > > 
> > > > > > Currently the vdsm side is merged, while the engine side
> > > > > > still
> > > > > > missing.
> > > > > 
> > > > > > The wiki page also lacks the engine side. Can you please
> > > > > > handle
> > > > > > it?
> > > > > 
> > > > 
> > > > > Hi Doron,
> > > > 
> > > > > I have updated the wiki page.
> > > > > http://wiki.ovirt.org/wiki/Add_an_option_to_create_a_watchdog_device
> > > > > And for vdsm side, I should also add a new patch to report
> > > > > the
> > > > > watchdog event.
> > > > 
> > > > > I can add a flat to vm's status, so engine can poll vm's
> > > > > status
> > > > > to
> > > > > check the event then notify the user, and let the user to
> > > > > take
> > > > > some
> > > > > actions, such as restart or dump guest for analysis.
> > > > > Perhaps event report channel is more better, but I have not
> > > > > find
> > > > > any
> > > > > in vdsm. But it is a big work to add an event register
> > > > > mechanism
> > > > > for
> > > > > vdsm.
> > > > 
> > > > > what's your suggestion?
> > > > 
> > > > > --
> > > > > Sheldon Feng(?????????) <shaohef at linux.vnet.ibm.com> IBM
> > > > > Linux
> > > > > Technology
> > > > > Center
> > > > 
> > > > Hi Sheldon,
> > > > AFAIK, watchdog fires automatically, so no real need for user
> > > > interaction
> > > > when an event happens. So I'd expect the user to set the
> > > > relevant
> > > > action
> > > > before starting the VM. Once the watchdog is triggered, it will
> > > > do
> > > > whatever
> > > > action he has set, and notify the user.
> > > > 
> > > > So I'd expect the user to have a list of actions for the
> > > > watchdog
> > > > device
> > > > in the engine UI, with a default of none. The user should be
> > > > able
> > > > to choose
> > > > which action to set when starting or editing the VM (for next
> > > > run).
> > > 
> > > I'd like to suggest we pick something other than none by default
> > > since
> > > we've gone through the trouble of configuring and enabling a
> > > watchdog.
> > > I think it's worth the discussion of what a better default
> > > behavior
> > > should be given access to a watchdog.
> > > 
> > > I'd suggest that a simple reboot mode would be most useful.
> > > 
> > 
> > Hi Ryan, good point.
> > The reason I asked for none is exactly since someone though of it
> > when writing the device actions. ie- otherwise no-op makes no
> > sense,
> > but as we all know no-op sometimes proves to be a much needed
> > option
> > if not the default one.
> > In this context, a watchdog has quite an explosive potential for a
> > VM.
> > So for the sake of all users I'd rather ask them to specify exactly
> > what should be done. Otherwise- Primum non nocere. I'm sure one day
> > someone will appreciate it.
> 
> While I understand what your saying; I think it's worth actually
> walking
> through all of the actions and selecting the best here.  VDSM has a
> role
> to play here in how *best* to configure a VM.  I think that a
> watchdog
> can elevate the usefulness of a VM by ensuring that it stays running
> without user intervention.
> 

Ryan, you're mixing vdsm and engine.
My response was to the way engine UI will present it to the user:

> > > > So I'd expect the user to have a list of actions for the
> > > > watchdog
> > > > device
> > > > in the engine UI, with a default of none. The user should be
> > > > able
> > > > to choose
> > > > which action to set when starting or editing the VM (for next
> > > > run).

So this is not about vdsm, but about engine UI.

As for VDSM's role on the best VM configuration, I disagree on this point.
What's best for your VM will not always be best for my VM, especially
when reboot is being considered. So unless there's a 100% fool-proof reason,
do no harm.

> As you say, having an unexpected reboot when it's not wanted can
> cause
> an issue, so we have at least two areas to discuss:
> 
> 1) watchdog fidelity; does it do what it's supposed to do at the
> right
> time and not malfunction.  This requires testing and use to validate.
> Leaving the watchdog off by default will certainly reduce the amount
> of
> testing time.
> 
> 2) watchdog configuration.  What's the most reasonable and helpful
> configuration, this includes the action as well as any variables
> associated with that specific action.  I think the best course here
> is
> to propose an initial configuration and start getting some test-time
> under the configuration for validation.
> 

Ryan, just reminding you this is an engine UI thread.
As such I'd be very careful from rebooting anything as a default.
This is not an audio or VGA card where you can fallback to lower resolution,
this will kill your guest, with everything running in it.

> If we're unwilling to enable an action by default, I'd like to have a
> discussion around why that's the case.  The initial objection to
> always-on with action=reboot seems to be concern about the watchdog
> misfiring when it shouldn't.   Are their other concerns?
> 

Yes. Googling will provide you several watchdog-related cases, which
I can't quote here due to copyrights of the relevant KBs. The general
idea is that one of 3 things causes WD to fire;
1. watchdog driver issues
2. Guest OS low on resources (potentially swapping), but still running
3. Host issues, such as sockets exhausted, etc.

The main thing is, that in none of the above, rebooting the VM will
improve the situation. If any it will make it worst. By default...

> Another thought here is to think about the target guest OS type.  It
> may
> be the case that specific actions/configurations make sense for one
> OS,
> but not the other[1]
> 
> There was an engine-devel thread about libosinfo integration[2].
> 
> 

See my previous comment for relevant cases. As a default watchdog policy
I'd rather be safe than sorry. Most KBs I saw would tell you to stop
the watchdog service / remove the device to begin with. Then you get
a bug fix. But as you probably understand, for some users this already
did some damage.

One more thing you need to consider is exporting and importing VMs, as
well as VM templates and pools. Here as well you may get unpleasant surprise
if you use a VM with a watchdog that will bite by default.

> 1.
> http://rwmj.wordpress.com/2010/03/03/what-is-a-watchdog/#comment-4959
> 2.
> http://lists.ovirt.org/pipermail/engine-devel/2012-September/002544.html
> 
> 
> --
> Ryan Harper
> Software Engineer; Linux Technology Center
> IBM Corp., Austin, Tx
> ryanh at us.ibm.com
> 
> 



More information about the Arch mailing list