[Engine-devel] SSH Soft Fencing

Sun Jun 30 14:40:49 UTC 2013

On Thu, Jun 27, 2013 at 08:48:39AM -0400, Eli Mesika wrote:
> 
> 
> ----- Original Message -----
> > From: "Martin Perina" <mperina at redhat.com>
> > To: engine-devel at ovirt.org
> > Cc: "Yair Zaslavsky" <yzaslavs at redhat.com>, "Barak Azulay" <bazulay at redhat.com>, "Eli Mesika" <emesika at redhat.com>
> > Sent: Thursday, June 27, 2013 1:51:06 PM
> > Subject: SSH Soft Fencing
> > 
> > Hi,
> > 
> > SSH Soft Fencing is a new feature for 3.3 and it tries to restart VDSM
> > using SSH connection on non responsive hosts prior to real fencing.
> > More info can be found at
> > 
> > http://www.ovirt.org/Automatic_Fencing#Automatic_Fencing_in_oVirt_3.3
> > 
> > In current SSH Soft Fencing implementation the restart VDSM using SSH
> > command is part of standard fencing implementation in
> > VdsNotRespondingTreatmentCommand. But this command is executed only
> > if a host has a valid PM configuration. If host doesn't have a valid
> > PM configuration, the execution of the command is disabled and host
> > state is change to Non Responsive.
> > 
> > So my question are:
> > 
> > 1) Should SSH Soft Fencing be executed on hosts without valid PM
> >    configuration?
> 
> I think that the answer should be yes. The vdsm restart will solve most of problems

Would you enumerate the problems that would be solved by a vdsm restart
(on list, but on the feature page, too)?
I am aware of two issues, both are vdsm bugs:
- If libvirtd crashes, vdsm not is not restarted unless there are
  running VMs
- Vdsm had several bugs in its soft prepareForShutdown process, getting
  itself stuck there in case of various background storage processes.

I think that solving these two issues would be safer and cleaner than
introducing `ssh host service vdsmd restart` flow.

The first issue is only a matter of untangling some vdsm internal
ugliness: whenever a libvirtconnection is produced, it should be wrapped
so that it cathces libvirt crashes. Unlike now, where only VM-related
libvirtconnection undergo this treatment.

The second issue can be avoiding by vdsm resorting to kill-9-ing itself.
After all, this is what `service vdsmd restart` ends up doing after a
VERY short timeout (2-3 seconds, iicr).

I suppose that there are other reasoning for a remote restart, but in
general, I think that it's better to have Vdsm "do the right thing" than
expecting Engine to control that remotely.

Regards,
Dan.
> , so why not using it whether a PM agent is defined or not.
> 
> > 
> > 2) Should VDSM restart using SSH command be reimplemented
> >    as standalone command to be usable also in other parts of engine?
> >    If 1) is true, I think it will have to be done anyway.
> 
> +1 
> 
> > 
> > 
> > Martin Perina
> > 
> _______________________________________________
> Engine-devel mailing list
> Engine-devel at ovirt.org
> http://lists.ovirt.org/mailman/listinfo/engine-devel