[ovirt-users] MoM is failing!!!

Piotr Kliczewski piotr.kliczewski at gmail.com
Mon Oct 16 15:27:40 UTC 2017


On Mon, Oct 16, 2017 at 4:51 PM, Erekle Magradze
<erekle.magradze at recogizer.de> wrote:
> That's the problem, at that time nobody has restarted the server.

Please provide engine log from this time so we could see whether it
was trigger by it.

>
> Is there any scenario when the hypervisor is restarted by engine?
>
> Cheers
>
> Erekle
>
>
>
> On 10/16/2017 04:45 PM, Piotr Kliczewski wrote:
>>
>> Erekle,
>>
>> For the time period you mentioned I do not see anything wrong on vdsm
>> side except of a restart at 2017-10-15 16:28:50,993+0200. It looks
>> like manual restart.
>> The engine log starts at 2017-10-16 03:49:04,092+02 so not able to say
>> whether there was anything else except of heartbeat issue caused by
>> the restart.
>>
>> The restart was the cause of "connection reset by peer" on mom side.
>>
>> Thanks,
>> Piotr
>>
>> On Mon, Oct 16, 2017 at 4:21 PM, Erekle Magradze
>> <erekle.magradze at recogizer.de> wrote:
>>>
>>> Hi Piotr,
>>>
>>> Several times I've restarted vdsm daemon on certain nods, that could be
>>> the
>>> reason.
>>>
>>> The failure, I've mentioned, has happened yesterday from 15:00 to 17:00
>>>
>>> Cheers
>>>
>>> Erekle
>>>
>>>
>>>
>>> On 10/16/2017 04:13 PM, Piotr Kliczewski wrote:
>>>>
>>>> Erekle,
>>>>
>>>> In the logs you provided I see:
>>>>
>>>> IOError: [Errno 5] _handleRequests._checkForMail - Could not read
>>>> mailbox:
>>>>
>>>> /rhev/data-center/6d52512e-1c02-4509-880a-bf57cbad4bdf/mastersd/dom_md/inbox
>>>>
>>>> and
>>>>
>>>> StorageDomainMasterError: Error validating master storage domain: ('MD
>>>> read error',)
>>>>
>>>> which seems to be cause for vdsm being killed by sanlock which caused
>>>> connection reset by peer.
>>>>
>>>> After vdsm restart storage looks good.
>>>>
>>>> @Nir can you take a look?
>>>>
>>>> Thanks,
>>>> Piotr
>>>>
>>>> On Mon, Oct 16, 2017 at 3:59 PM, Erekle Magradze
>>>> <erekle.magradze at recogizer.de> wrote:
>>>>>
>>>>> Hi,
>>>>>
>>>>> The issue is the following, after installation of ovirt 4.1 on three
>>>>> nodes
>>>>> with glusterFS as a storage, oVirt engine reported the failed events,
>>>>> with
>>>>> the following message
>>>>>
>>>>> VDSM hostname command GetStatsVDS failed: Connection reset by peer
>>>>>
>>>>> after that oVirt was trying to fence the affected host and it was
>>>>> excluded
>>>>> from production, luckily I am not running any VMs on it yet.
>>>>>
>>>>> The logs are attached, don't be surprised with the hostnames :)
>>>>>
>>>>> Thanks in advance
>>>>>
>>>>> Cheers
>>>>>
>>>>> Erekle
>>>>>
>>>>>
>>>>> On 10/16/2017 03:37 PM, Dafna Ron wrote:
>>>>>
>>>>> Hi,
>>>>>
>>>>> Can you please tell us what is the issue that you are actually facing?
>>>>> :)
>>>>> it
>>>>> would be easier to debug an issue and not an error message that can be
>>>>> cause
>>>>> by several things.
>>>>>
>>>>> Also, can you provide the engine and the vdsm logs?
>>>>>
>>>>> thank you,
>>>>> Dafna
>>>>>
>>>>>
>>>>> On 10/16/2017 02:30 PM, Erekle Magradze wrote:
>>>>>
>>>>> It's was a typo in the failure message,
>>>>>
>>>>> that's what I was getting:
>>>>>
>>>>> VDSM hostname command GetStatsVDS failed: Connection reset by peer
>>>>>
>>>>>
>>>>> On 10/16/2017 03:21 PM, Erekle Magradze wrote:
>>>>>
>>>>> Hi,
>>>>>
>>>>> It's getting clear now, indeed momd service is disabled
>>>>>
>>>>> ● momd.service - Memory Overcommitment Manager Daemon
>>>>>      Loaded: loaded (/usr/lib/systemd/system/momd.service; static;
>>>>> vendor
>>>>> preset: disabled)
>>>>>      Active: inactive (dead)
>>>>>
>>>>> mom-vdsm is enable and running.
>>>>>
>>>>> ● mom-vdsm.service - MOM instance configured for VDSM purposes
>>>>>      Loaded: loaded (/usr/lib/systemd/system/mom-vdsm.service; enabled;
>>>>> vendor
>>>>> preset: enabled)
>>>>>      Active: active (running) since Mon 2017-10-16 15:14:35 CEST; 1min
>>>>> 3s
>>>>> ago
>>>>>    Main PID: 27638 (python)
>>>>>      CGroup: /system.slice/mom-vdsm.service
>>>>>              └─27638 python /usr/sbin/momd -c /etc/vdsm/mom.conf
>>>>>
>>>>> The reason why I came up with digging in mom problems is the following
>>>>> problem
>>>>>
>>>>>
>>>>> VDSM hostname command GetStatsVDSThanks failed: Connection reset by
>>>>> peer
>>>>>
>>>>> that is causing fencing of the node where the failure is happening,
>>>>> what
>>>>> could be the reason of GetStatsVDS failure?
>>>>>
>>>>> Best Regards
>>>>> Erekle
>>>>>
>>>>>
>>>>> On 10/16/2017 03:11 PM, Martin Sivak wrote:
>>>>>
>>>>> Hi,
>>>>>
>>>>> how do you start MOM? MOM is supposed to talk to vdsm, we do not talk
>>>>> to libvirt directly. The line you posted comes from vdsm and vdsm is
>>>>> telling you it can't talk to MOM.
>>>>>
>>>>> Which MOM service is enabled? Because there are two momd and mom-vdsm,
>>>>> the second one is the one that should be enabled.
>>>>>
>>>>> Best regards
>>>>>
>>>>> Martin Sivak
>>>>>
>>>>>
>>>>> On Mon, Oct 16, 2017 at 3:04 PM, Erekle Magradze
>>>>> <erekle.magradze at recogizer.de> wrote:
>>>>>
>>>>> Hi Martin,
>>>>>
>>>>> Thanks for the answer, unfortunately this warning message persists,
>>>>> does
>>>>> it
>>>>> mean that mom cannot communicate with libvirt? how critical is it?
>>>>>
>>>>> Best
>>>>>
>>>>> Erekle
>>>>>
>>>>>
>>>>>
>>>>> On 10/16/2017 03:03 PM, Martin Sivak wrote:
>>>>>
>>>>> Hi,
>>>>>
>>>>> it is just a warning, there is nothing you have to solve unless it
>>>>> does not resolve itself within a minute or so. If it happens only once
>>>>> or twice after vdsm or mom restart then you are fine.
>>>>>
>>>>> Best regards
>>>>>
>>>>> --
>>>>> Martin Sivak
>>>>> SLA / oVirt
>>>>>
>>>>> On Mon, Oct 16, 2017 at 2:44 PM, Erekle Magradze
>>>>> <erekle.magradze at recogizer.de> wrote:
>>>>>
>>>>> Hi,
>>>>>
>>>>> after running
>>>>>
>>>>> systemctl status vdsm I am getting that it's running and this message
>>>>> at
>>>>> the
>>>>> end.
>>>>>
>>>>> Oct 16 14:26:52 hostname vdsmd[2392]: vdsm throttled WARN MOM not
>>>>> available.
>>>>> Oct 16 14:26:52 hostname vdsmd[2392]: vdsm throttled WARN MOM not
>>>>> available,
>>>>> KSM stats will be missing.
>>>>> Oct 16 14:26:57 hostname vdsmd[2392]: vdsm root WARN ping was
>>>>> deprecated
>>>>> in
>>>>> favor of ping2 and confirmConnectivity
>>>>>
>>>>> how critical it is? and how to solve that warning?
>>>>>
>>>>> I am using libvirt
>>>>>
>>>>> Cheers
>>>>>
>>>>> _______________________________________________
>>>>> Users mailing list
>>>>> Users at ovirt.org
>>>>> http://lists.ovirt.org/mailman/listinfo/users
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> Users mailing list
>>>>> Users at ovirt.org
>>>>> http://lists.ovirt.org/mailman/listinfo/users
>>>>>
>>>>>
>>>>> --
>>>>> Recogizer Group GmbH
>>>>>
>>>>> Dr.rer.nat. Erekle Magradze
>>>>> Lead Big Data Engineering & DevOps
>>>>> Rheinwerkallee 2, 53227 Bonn
>>>>> Tel: +49 228 29974555
>>>>>
>>>>> E-Mail erekle.magradze at recogizer.de
>>>>> Web: www.recogizer.com
>>>>>
>>>>> Recogizer auf LinkedIn https://www.linkedin.com/company-beta/10039182/
>>>>> Folgen Sie uns auf Twitter https://twitter.com/recogizer
>>>>>
>>>>> -----------------------------------------------------------------
>>>>> Recogizer Group GmbH
>>>>> Geschäftsführer: Oliver Habisch, Carsten Kreutze
>>>>> Handelsregister: Amtsgericht Bonn HRB 20724
>>>>> Sitz der Gesellschaft: Bonn; USt-ID-Nr.: DE294195993
>>>>>
>>>>> Diese E-Mail enthält vertrauliche und/oder rechtlich geschützte
>>>>> Informationen.
>>>>> Wenn Sie nicht der richtige Adressat sind oder diese E-Mail irrtümlich
>>>>> erhalten haben,
>>>>> informieren Sie bitte sofort den Absender und löschen Sie diese Mail.
>>>>> Das unerlaubte Kopieren sowie die unbefugte Weitergabe dieser Mail und
>>>>> der
>>>>> darin enthaltenen Informationen ist nicht gestattet.
>>>>>
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> Users mailing list
>>>>> Users at ovirt.org
>>>>> http://lists.ovirt.org/mailman/listinfo/users
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Recogizer Group GmbH
>>>>>
>>>>> Dr.rer.nat. Erekle Magradze
>>>>> Lead Big Data Engineering & DevOps
>>>>> Rheinwerkallee 2, 53227 Bonn
>>>>> Tel: +49 228 29974555
>>>>>
>>>>> E-Mail erekle.magradze at recogizer.de
>>>>> Web: www.recogizer.com
>>>>>
>>>>> Recogizer auf LinkedIn https://www.linkedin.com/company-beta/10039182/
>>>>> Folgen Sie uns auf Twitter https://twitter.com/recogizer
>>>>>
>>>>> -----------------------------------------------------------------
>>>>> Recogizer Group GmbH
>>>>> Geschäftsführer: Oliver Habisch, Carsten Kreutze
>>>>> Handelsregister: Amtsgericht Bonn HRB 20724
>>>>> Sitz der Gesellschaft: Bonn; USt-ID-Nr.: DE294195993
>>>>>
>>>>> Diese E-Mail enthält vertrauliche und/oder rechtlich geschützte
>>>>> Informationen.
>>>>> Wenn Sie nicht der richtige Adressat sind oder diese E-Mail irrtümlich
>>>>> erhalten haben,
>>>>> informieren Sie bitte sofort den Absender und löschen Sie diese Mail.
>>>>> Das unerlaubte Kopieren sowie die unbefugte Weitergabe dieser Mail und
>>>>> der
>>>>> darin enthaltenen Informationen ist nicht gestattet.
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> Users mailing list
>>>>> Users at ovirt.org
>>>>> http://lists.ovirt.org/mailman/listinfo/users
>>>>>
>>> --
>>> Recogizer Group GmbH
>>>
>>> Dr.rer.nat. Erekle Magradze
>>> Lead Big Data Engineering & DevOps
>>> Rheinwerkallee 2, 53227 Bonn
>>> Tel: +49 228 29974555
>>>
>>> E-Mail erekle.magradze at recogizer.de
>>> Web: www.recogizer.com
>>>   Recogizer auf LinkedIn https://www.linkedin.com/company-beta/10039182/
>>> Folgen Sie uns auf Twitter https://twitter.com/recogizer
>>>   -----------------------------------------------------------------
>>> Recogizer Group GmbH
>>> Geschäftsführer: Oliver Habisch, Carsten Kreutze
>>> Handelsregister: Amtsgericht Bonn HRB 20724
>>> Sitz der Gesellschaft: Bonn; USt-ID-Nr.: DE294195993
>>>   Diese E-Mail enthält vertrauliche und/oder rechtlich geschützte
>>> Informationen.
>>> Wenn Sie nicht der richtige Adressat sind oder diese E-Mail irrtümlich
>>> erhalten haben,
>>> informieren Sie bitte sofort den Absender und löschen Sie diese Mail.
>>> Das unerlaubte Kopieren sowie die unbefugte Weitergabe dieser Mail und
>>> der
>>> darin enthaltenen Informationen ist nicht gestattet.
>>>
>
> --
> Recogizer Group GmbH
>
> Dr.rer.nat. Erekle Magradze
> Lead Big Data Engineering & DevOps
> Rheinwerkallee 2, 53227 Bonn
> Tel: +49 228 29974555
>
> E-Mail erekle.magradze at recogizer.de
> Web: www.recogizer.com
>  Recogizer auf LinkedIn https://www.linkedin.com/company-beta/10039182/
> Folgen Sie uns auf Twitter https://twitter.com/recogizer
>  -----------------------------------------------------------------
> Recogizer Group GmbH
> Geschäftsführer: Oliver Habisch, Carsten Kreutze
> Handelsregister: Amtsgericht Bonn HRB 20724
> Sitz der Gesellschaft: Bonn; USt-ID-Nr.: DE294195993
>  Diese E-Mail enthält vertrauliche und/oder rechtlich geschützte
> Informationen.
> Wenn Sie nicht der richtige Adressat sind oder diese E-Mail irrtümlich
> erhalten haben,
> informieren Sie bitte sofort den Absender und löschen Sie diese Mail.
> Das unerlaubte Kopieren sowie die unbefugte Weitergabe dieser Mail und der
> darin enthaltenen Informationen ist nicht gestattet.
>


More information about the Users mailing list