[Users] Excessive syslog logging from vdsm/sampling.py

Nir Soffer nsoffer at redhat.com
Thu Dec 19 15:35:14 EST 2013


----- Original Message -----
> From: "Nir Soffer" <nsoffer at redhat.com>
> To: "Sander Grendelman" <sander at grendelman.com>
> Cc: users at ovirt.org, "Michal Skrivanek" <mskrivan at redhat.com>
> Sent: Wednesday, December 18, 2013 5:46:37 PM
> Subject: Re: [Users] Excessive syslog logging from vdsm/sampling.py
> 
> ----- Original Message -----
> > From: "Sander Grendelman" <sander at grendelman.com>
> > To: "Nir Soffer" <nsoffer at redhat.com>
> > Cc: "Michal Skrivanek" <mskrivan at redhat.com>, users at ovirt.org, "Ayal Baron"
> > <abaron at redhat.com>
> > Sent: Wednesday, December 18, 2013 4:00:21 PM
> > Subject: Re: [Users] Excessive syslog logging from vdsm/sampling.py
> > 
> > On Wed, Dec 18, 2013 at 1:01 PM, Nir Soffer <nsoffer at redhat.com> wrote:
> > >> > We should check why vmDrive does not have the format attribute in your
> > >> > case.
> > >>
> > >> Probably because this was checked during migration it's a race between
> > >> migration finish and monitoring checking this.
> > >> You can see the error repeated in other cases as well (in bugzilla):
> > >> 988047,
> > >> 994534
> > >
> > > Then this patch should fix your problem:
> > > http://gerrit.ovirt.org/22518
> > 
> > I applied this patch and restarted vdsmd on both nodes
> > (maintenance->restart->activate).
> > The errors still occur, but it is limited to fewer machines ( see [1] ).
> > 
> > - Almost all VMs were imported with virt-v2v (not only the VMs with
> > errors).
> > - Almost all VMs were migrated to another storage domain (offline).
> > - Migration to another host was done with all VMs
> > 
> > The current vdsm logs for both are attached to this e-mail.
> > Please let me know if you need any additional information.
> > 
> > [1]: unique sampling.py errors on both nodes.
> > 
> > #### Node 1
> > ## Before patch
> > [root at gnkvm01 ~]# tail -n 2000 /var/log/messages-20131215 | awk
> > '/sampling.py/ {print $8}' | sort -u
> > vmId=`0ae3a3d7-ead9-4c0d-9df0-3901b6e6859c`::Stats
> > vmId=`22654002-cbef-454d-b001-7823da5f592f`::Stats
> > vmId=`3e481c73-57df-4dc3-8b1c-421a74308a5e`::Stats
> > vmId=`57dbe688-4e18-4358-aa3e-f3f6022ef9b3`::Stats
> > vmId=`66aa5555-2299-4d93-931d-b7a2e421b7e9`::Stats
> > vmId=`6df65698-4995-4c75-9433-75affe9b9c38`::Stats
> > vmId=`9260c69c-93a2-4f8a-b5e9-eaab5e4f4708`::Stats
> > vmId=`9edb3e08-f098-4633-a122-e5ba29ae12ea`::Stats
> > vmId=`c6f56584-1ccd-4c02-be94-897a4e747d34`::Stats
> > vmId=`d3dae626-279b-4bcf-afc4-7a3c198a3035`::Stats
> > 
> > ## After patch
> > [root at gnkvm01 ~]# tail -n 2000 /var/log/messages | awk '/sampling.py/
> > {print $8}' | sort -u
> > vmId=`007ca72e-d0d0-4477-87d4-fb60328cd882`::Stats
> > vmId=`1075a178-a4c6-4a8f-a199-56401cd0652f`::Stats
> > 
> > #### Node 2
> > ## Before patch
> > [root at gnkvm02 ~]# tail -n 2000 /var/log/messages-20131215 | awk
> > '/sampling.py/ {print $8}' | sort -u
> > vmId=`00317758-16fe-4ac6-b9fd-d522c9908861`::Stats
> > vmId=`007ca72e-d0d0-4477-87d4-fb60328cd882`::Stats
> > vmId=`06405f12-d763-4bd6-b5e5-997e3f6bb1f6`::Stats
> > vmId=`1075a178-a4c6-4a8f-a199-56401cd0652f`::Stats
> > vmId=`1bba8930-9c04-4c5c-8b15-c9fe14022cb5`::Stats
> > vmId=`2036c21d-e0a4-4d55-a9a7-4cd9dd9d250d`::Stats
> > vmId=`5fff0cc7-24e4-4e4a-b220-ba49f9145060`::Stats
> > vmId=`86708f62-fcc6-4d0f-978a-3788a61f9775`::Stats
> > vmId=`9b8e6d07-295c-404d-a672-efc94a24b6bc`::Stats
> > vmId=`aa0445b6-8ca5-4557-9f9b-ee543d6435df`::Stats
> > 
> > ## After patch
> > [root at gnkvm02 ~]# tail -n 2000 /var/log/messages | awk '/sampling.py/
> > {print $8}' | sort -u
> > vmId=`d3dae626-279b-4bcf-afc4-7a3c198a3035`::Stats
> 
> I think I understand your logs:
> 
> On node1 you started vdsm with the patch on 2013-12-18 13:33:18.
> 
> The error "object has no attribute 'format'" is gone with this patch.
> 
> On node2 I don't see this error at all, and I don't see a restart of vdsm.
> Maybe the entire log was generated with the patch?
> 
> It would be nice if you open a bug for this issue.
> 
> So one problem is gone, but you have other problems:
> 
> Thread-361::DEBUG::2013-12-18
> 14:41:42,512::libvirtconnection::108::libvirtconnection::(wrapper) Unknown
> libvirterror: ecode: 8 edom: 10 level: 2 message: invalid argument: invalid
> path
> /rhev/data-center/mnt/blockSD/b526b148-b810-47c6-9bdd-4fd8d8226855/images/c928f2c7-9ba0-496f-8be9-b8804fdc1a6d/a2da1070-8de7-4e1f-b736-4c88d089a5cc
> not assigned to domain
> 
> $ grep 'not assigned to domain' vdsm-node1.log | wc -l
> 11488
> $ grep 'not assigned to domain' vdsm-node2.log | wc -l
> 3326
> 
> $ grep 'not assigned to domain' vdsm-node1.log | awk '{print $6}' | sort -u
> /rhev/data-center/mnt/blockSD/b526b148-b810-47c6-9bdd-4fd8d8226855/images/347f2238-c018-4370-94df-bd1e81f8b854/9e5dad95-73ea-4e5c-aa13-522efd9bad11
> /rhev/data-center/mnt/blockSD/b526b148-b810-47c6-9bdd-4fd8d8226855/images/47015859-1995-4ce2-880c-a3c7068a67dd/341b32d6-4276-454d-b3f0-789b705c99cc
> /rhev/data-center/mnt/blockSD/b526b148-b810-47c6-9bdd-4fd8d8226855/images/c928f2c7-9ba0-496f-8be9-b8804fdc1a6d/a2da1070-8de7-4e1f-b736-4c88d089a5cc
> 
> $ grep 'not assigned to domain' vdsm-node2.log | awk '{print $6}' | sort -u
> /rhev/data-center/mnt/blockSD/b526b148-b810-47c6-9bdd-4fd8d8226855/images/347f2238-c018-4370-94df-bd1e81f8b854/9e5dad95-73ea-4e5c-aa13-522efd9bad11
> 
> This issue deserve its own bug.

The version you run has several problems:

1. Sometimes drive does not have a format attribute. We do not know yet why this 
   happens.

2. Sometimes sampling drive fails because the drive path is unknown to libvirt.
   This may be a non-existing drive that should have been removed. We do not 
   know yet why this happens.

3. When such error happen, and you have multiple disks on a vm, sampling stop
   and next disks in the disk list are skipped. This can explain why your disk
   stats are constant.

4. The errors spam your vdsm log and syslog

These 3 patches do not fix the underlying problems causing the errors, but 
improve error handling and decrease log spam.

1. http://gerrit.ovirt.org/22518 - vm: Avoid log spamming when drive format is undefined
   This is a cleaner replacement for the patch you already applied, and required by 
   the next two patches. This patch is likely to be merged to master

2. http://gerrit.ovirt.org/22575 - vm: Continue to sample after errors
   This patch fix error handling so failure to sample one disk does not effect sampling
   of other disks. You should see disk stats for "good" drives even if you have one "bad" 
   drive on a vm.

3. http://gerrit.ovirt.org/22576 - vm: Decrease log spam for expected errors
   This patch logs one line for expected errors instead of 15 lines traceback, 
   reducing log spam.

It will be nice if you test these patches and report the results.

You ping me on #vdsm if you like to discuss this or need more help.

Thanks,
Nir


More information about the Users mailing list