Silent errors in storage code on Vdsm startup

Hi, I'm trying to run Vdsm with Python 3, which exposes various Python 3 incompatibilities in Vdsm code. I'm currently semi-stuck on Vdsm crashing very shortly after startup with: Panic: Error initializing IRS Traceback (most recent call last): File "/usr/lib/python3.6/site-packages/vdsm/vdsmd.py", line 95, in serve_clients irs = Dispatcher(HSM()) File "/usr/lib/python3.6/site-packages/vdsm/storage/dispatcher.py", line 46, in __init__ self._exposeFunctions(obj) File "/usr/lib/python3.6/site-packages/vdsm/storage/dispatcher.py", line 56, in _exposeFunctions if hasattr(funcObj, _EXPORTED_ATTRIBUTE) and callable(funcObj): File "/usr/lib/python3.6/site-packages/vdsm/storage/sp.py", line 85, in __getattr__ raise se.StoragePoolNotConnected vdsm.storage.exception.StoragePoolNotConnected: Storage pool not connected: () That wouldn't be a problem by itself, errors are expected in Python 3 experiments. However the problem is that this is often the *only* error I receive. I've found out that HSM() call spawns threads, which apparently don't have chance to report their errors before the panic above kills Vdsm. For instance, there was a string-byte mismatch in running iscsiadm scanning commands. Only after I added time.sleep to the end of HSM.__init__, I got an error + traceback from that and could see what's wrong. That worries me a bit, because it might mean that if some initialization external commands are not fast enough, Vdsm may not be able to start. After fixing some problems, I get the panic again, even with the sleep at the end of HSM.__init__, and again without any other error. It's difficult to proceed in such a situation -- any tips what to look at or to look for are welcome. FWIW, my vdsm.log now ends with INFO (MainThread) [storage.check] Starting check service (check:91) DEBUG (check/loop) [root] START thread <Thread(check/loop, started daemon 140180434425600)> (func=<bound method EventLoop.run_forever of <EventLoop running=True closed=False at 0x140181096264704>>, args=(), kwargs={}) (concurrent:193) INFO (check/loop) [storage.asyncevent] Starting <EventLoop running=True closed=False at 0x140181096264704> (asyncevent:125) DEBUG (hsm/init) [storage.SamplingMethod] Returning last result (misc:386) DEBUG (hsm/init) [storage.SamplingMethod] Trying to enter sampling method (vdsm.storage.hba.<function rescan at 0x7f7e7a597e18>) (misc:376) DEBUG (hsm/init) [storage.SamplingMethod] Got in to sampling method (misc:379) DEBUG (hsm/init) [storage.HBA] Starting scan (hba:58) DEBUG (hsm/init) [storage.HBA] Scan finished (hba:64) DEBUG (hsm/init) [storage.SamplingMethod] Returning last result (misc:386) DEBUG (hsm/init) [root] /usr/bin/taskset --cpu-list 0-0 /sbin/udevadm settle --timeout=5 (cwd None) (commands:200) DEBUG (hsm/init) [root] SUCCESS: <err> = b''; <rc> = 0 (commands:221) DEBUG (hsm/init) [storage.SamplingMethod] Returning last result (misc:386) DEBUG (hsm/init) [storage.Misc.excCmd] /usr/bin/taskset --cpu-list 0-0 /usr/bin/sudo -n /sbin/lvm pvs --config 'devices { preferred_names=["^/dev/mapper/"] ignore_suspended_devices=1 write_cache_state=0 disable_after_error_count=3 filter=["r|.*|"] } global { locking_type=1 prioritise_write_locks=1 wait_for_locks=1 use_lvmetad=0 } backup { retain_min=50 retain_days=0 }' --noheadings --units b --nosuffix --separator '|' --ignoreskippedcluster -o uuid,name,size,vg_name,vg_uuid,pe_start,pe_count,pe_alloc_count,mda_count,dev_size,mda_used_count (cwd None) (commands:200) DEBUG (hsm/init) [storage.Misc.excCmd] SUCCESS: <err> = b''; <rc> = 0 (commands:221) DEBUG (hsm/init) [storage.Misc.excCmd] /usr/bin/taskset --cpu-list 0-0 /usr/bin/sudo -n /sbin/lvm vgs --config 'devices { preferred_names=["^/dev/mapper/"] ignore_suspended_devices=1 write_cache_state=0 disable_after_error_count=3 filter=["r|.*|"] } global { locking_type=1 prioritise_write_locks=1 wait_for_locks=1 use_lvmetad=0 } backup { retain_min=50 retain_days=0 }' --noheadings --units b --nosuffix --separator '|' --ignoreskippedcluster -o uuid,name,attr,size,free,extent_size,extent_count,free_count,tags,vg_mda_size,vg_mda_free,lv_count,pv_count,pv_name (cwd None) (commands:200) DEBUG (hsm/init) [storage.Misc.excCmd] SUCCESS: <err> = b''; <rc> = 0 (commands:221) DEBUG (hsm/init) [storage.Misc.excCmd] /usr/bin/taskset --cpu-list 0-0 /usr/bin/sudo -n /sbin/lvm lvs --config 'devices { preferred_names=["^/dev/mapper/"] ignore_suspended_devices=1 write_cache_state=0 disable_after_error_count=3 filter=["r|.*|"] } global { locking_type=1 prioritise_write_locks=1 wait_for_locks=1 use_lvmetad=0 } backup { retain_min=50 retain_days=0 }' --noheadings --units b --nosuffix --separator '|' --ignoreskippedcluster -o uuid,name,vg_name,attr,size,seg_start_pe,devices,tags (cwd None) (commands:200) DEBUG (hsm/init) [storage.Misc.excCmd] SUCCESS: <err> = b''; <rc> = 0 (commands:221) DEBUG (hsm/init) [storage.Misc.excCmd] /usr/bin/taskset --cpu-list 0-0 /usr/bin/sudo -n /sbin/lvm vgs --config 'devices { preferred_names=["^/dev/mapper/"] ignore_suspended_devices=1 write_cache_state=0 disable_after_error_count=3 filter=["r|.*|"] } global { locking_type=1 prioritise_write_locks=1 wait_for_locks=1 use_lvmetad=0 } backup { retain_min=50 retain_days=0 }' --noheadings --units b --nosuffix --separator '|' --ignoreskippedcluster -o uuid,name,attr,size,free,extent_size,extent_count,free_count,tags,vg_mda_size,vg_mda_free,lv_count,pv_count,pv_name (cwd None) (commands:200) DEBUG (hsm/init) [storage.Misc.excCmd] SUCCESS: <err> = b''; <rc> = 0 (commands:221) INFO (hsm/init) [storage.HSM] FINISH HSM init succeeded in 0.43 seconds (hsm:410) DEBUG (hsm/init) [storage.HSM] FINISH thread <Thread(hsm/init, stopped daemon 140180442818304)> (concurrent:196) and is completely silent after that. Thanks, Milan

On Tue, Feb 5, 2019 at 7:15 PM Milan Zamazal <mzamazal@redhat.com> wrote:
Hi, I'm trying to run Vdsm with Python 3, which exposes various Python 3 incompatibilities in Vdsm code. I'm currently semi-stuck on Vdsm crashing very shortly after startup with:
Panic: Error initializing IRS Traceback (most recent call last): File "/usr/lib/python3.6/site-packages/vdsm/vdsmd.py", line 95, in serve_clients irs = Dispatcher(HSM()) File "/usr/lib/python3.6/site-packages/vdsm/storage/dispatcher.py", line 46, in __init__ self._exposeFunctions(obj) File "/usr/lib/python3.6/site-packages/vdsm/storage/dispatcher.py", line 56, in _exposeFunctions if hasattr(funcObj, _EXPORTED_ATTRIBUTE) and callable(funcObj): File "/usr/lib/python3.6/site-packages/vdsm/storage/sp.py", line 85, in __getattr__ raise se.StoragePoolNotConnected vdsm.storage.exception.StoragePoolNotConnected: Storage pool not connected: ()
This should be fixed first. exposing HSM function should not raise this error. The best way would be to write tests for the dispatcher and run the tests with python 3. When the dispatcher is ported to python 3 you can try again to start vdsm. Even if you port entire vdsm to python 3, sanlock does not ship yet python 3 bindings, so you cannot have any storage except local storage, iso domain and export domain. That wouldn't be a problem by itself, errors are expected in Python 3
experiments. However the problem is that this is often the *only* error I receive.
I've found out that HSM() call spawns threads, which apparently don't have chance to report their errors before the panic above kills Vdsm. For instance, there was a string-byte mismatch in running iscsiadm scanning commands. Only after I added time.sleep to the end of HSM.__init__, I got an error + traceback from that and could see what's wrong. That worries me a bit, because it might mean that if some initialization external commands are not fast enough, Vdsm may not be able to start.
The real fix for this issue to eliminate the storage startup code. Storage should do the minimal initialization needed (e.g. start monitoring udev events) and wait for engine. When engine ask to connect to storage, vdsm should the needed scanning and lvm initialization that is done now during startup. Then if some storage code fail, only connecting to this kind of storage would fail and vdsm will not panic. As temporary solution you can disable the storage refresh thread started in HSM.__init__, and set HSM.ready to True immediately. This will have similar effect to the long term fix. After fixing some problems, I get the panic again, even with the sleep
at the end of HSM.__init__, and again without any other error. It's difficult to proceed in such a situation -- any tips what to look at or to look for are welcome.
You can disable irs in the config, but then you will have no storage. Maybe this is good enough at this point. Nir
FWIW, my vdsm.log now ends with
INFO (MainThread) [storage.check] Starting check service (check:91) DEBUG (check/loop) [root] START thread <Thread(check/loop, started daemon 140180434425600)> (func=<bound method EventLoop.run_forever of <EventLoop running=True closed=False at 0x140181096264704>>, args=(), kwargs={}) (concurrent:193) INFO (check/loop) [storage.asyncevent] Starting <EventLoop running=True closed=False at 0x140181096264704> (asyncevent:125) DEBUG (hsm/init) [storage.SamplingMethod] Returning last result (misc:386) DEBUG (hsm/init) [storage.SamplingMethod] Trying to enter sampling method (vdsm.storage.hba.<function rescan at 0x7f7e7a597e18>) (misc:376) DEBUG (hsm/init) [storage.SamplingMethod] Got in to sampling method (misc:379) DEBUG (hsm/init) [storage.HBA] Starting scan (hba:58) DEBUG (hsm/init) [storage.HBA] Scan finished (hba:64) DEBUG (hsm/init) [storage.SamplingMethod] Returning last result (misc:386) DEBUG (hsm/init) [root] /usr/bin/taskset --cpu-list 0-0 /sbin/udevadm settle --timeout=5 (cwd None) (commands:200) DEBUG (hsm/init) [root] SUCCESS: <err> = b''; <rc> = 0 (commands:221) DEBUG (hsm/init) [storage.SamplingMethod] Returning last result (misc:386) DEBUG (hsm/init) [storage.Misc.excCmd] /usr/bin/taskset --cpu-list 0-0 /usr/bin/sudo -n /sbin/lvm pvs --config 'devices { preferred_names=["^/dev/mapper/"] ignore_suspended_devices=1 write_cache_state=0 disable_after_error_count=3 filter=["r|.*|"] } global { locking_type=1 prioritise_write_locks=1 wait_for_locks=1 use_lvmetad=0 } backup { retain_min=50 retain_days=0 }' --noheadings --units b --nosuffix --separator '|' --ignoreskippedcluster -o uuid,name,size,vg_name,vg_uuid,pe_start,pe_count,pe_alloc_count,mda_count,dev_size,mda_used_count (cwd None) (commands:200) DEBUG (hsm/init) [storage.Misc.excCmd] SUCCESS: <err> = b''; <rc> = 0 (commands:221) DEBUG (hsm/init) [storage.Misc.excCmd] /usr/bin/taskset --cpu-list 0-0 /usr/bin/sudo -n /sbin/lvm vgs --config 'devices { preferred_names=["^/dev/mapper/"] ignore_suspended_devices=1 write_cache_state=0 disable_after_error_count=3 filter=["r|.*|"] } global { locking_type=1 prioritise_write_locks=1 wait_for_locks=1 use_lvmetad=0 } backup { retain_min=50 retain_days=0 }' --noheadings --units b --nosuffix --separator '|' --ignoreskippedcluster -o uuid,name,attr,size,free,extent_size,extent_count,free_count,tags,vg_mda_size,vg_mda_free,lv_count,pv_count,pv_name (cwd None) (commands:200) DEBUG (hsm/init) [storage.Misc.excCmd] SUCCESS: <err> = b''; <rc> = 0 (commands:221) DEBUG (hsm/init) [storage.Misc.excCmd] /usr/bin/taskset --cpu-list 0-0 /usr/bin/sudo -n /sbin/lvm lvs --config 'devices { preferred_names=["^/dev/mapper/"] ignore_suspended_devices=1 write_cache_state=0 disable_after_error_count=3 filter=["r|.*|"] } global { locking_type=1 prioritise_write_locks=1 wait_for_locks=1 use_lvmetad=0 } backup { retain_min=50 retain_days=0 }' --noheadings --units b --nosuffix --separator '|' --ignoreskippedcluster -o uuid,name,vg_name,attr,size,seg_start_pe,devices,tags (cwd None) (commands:200) DEBUG (hsm/init) [storage.Misc.excCmd] SUCCESS: <err> = b''; <rc> = 0 (commands:221) DEBUG (hsm/init) [storage.Misc.excCmd] /usr/bin/taskset --cpu-list 0-0 /usr/bin/sudo -n /sbin/lvm vgs --config 'devices { preferred_names=["^/dev/mapper/"] ignore_suspended_devices=1 write_cache_state=0 disable_after_error_count=3 filter=["r|.*|"] } global { locking_type=1 prioritise_write_locks=1 wait_for_locks=1 use_lvmetad=0 } backup { retain_min=50 retain_days=0 }' --noheadings --units b --nosuffix --separator '|' --ignoreskippedcluster -o uuid,name,attr,size,free,extent_size,extent_count,free_count,tags,vg_mda_size,vg_mda_free,lv_count,pv_count,pv_name (cwd None) (commands:200) DEBUG (hsm/init) [storage.Misc.excCmd] SUCCESS: <err> = b''; <rc> = 0 (commands:221) INFO (hsm/init) [storage.HSM] FINISH HSM init succeeded in 0.43 seconds (hsm:410) DEBUG (hsm/init) [storage.HSM] FINISH thread <Thread(hsm/init, stopped daemon 140180442818304)> (concurrent:196)
and is completely silent after that.
Thanks, Milan

On Tue, Feb 5, 2019 at 9:25 PM Nir Soffer <nsoffer@redhat.com> wrote:
On Tue, Feb 5, 2019 at 7:15 PM Milan Zamazal <mzamazal@redhat.com> wrote:
Hi, I'm trying to run Vdsm with Python 3, which exposes various Python 3 incompatibilities in Vdsm code. I'm currently semi-stuck on Vdsm crashing very shortly after startup with:
Panic: Error initializing IRS Traceback (most recent call last): File "/usr/lib/python3.6/site-packages/vdsm/vdsmd.py", line 95, in serve_clients irs = Dispatcher(HSM()) File "/usr/lib/python3.6/site-packages/vdsm/storage/dispatcher.py", line 46, in __init__ self._exposeFunctions(obj) File "/usr/lib/python3.6/site-packages/vdsm/storage/dispatcher.py", line 56, in _exposeFunctions if hasattr(funcObj, _EXPORTED_ATTRIBUTE) and callable(funcObj): File "/usr/lib/python3.6/site-packages/vdsm/storage/sp.py", line 85, in __getattr__ raise se.StoragePoolNotConnected vdsm.storage.exception.StoragePoolNotConnected: Storage pool not connected: ()
This should be fixed first. exposing HSM function should not raise this error.
Can you check if https://gerrit.ovirt.org/c/97590/ fixes this issue?
The best way would be to write tests for the dispatcher and run the tests with python 3. When the dispatcher is ported to python 3 you can try again to start vdsm.
Even if you port entire vdsm to python 3, sanlock does not ship yet python 3 bindings, so you cannot have any storage except local storage, iso domain and export domain.
That wouldn't be a problem by itself, errors are expected in Python 3
experiments. However the problem is that this is often the *only* error I receive.
I've found out that HSM() call spawns threads, which apparently don't have chance to report their errors before the panic above kills Vdsm. For instance, there was a string-byte mismatch in running iscsiadm scanning commands. Only after I added time.sleep to the end of HSM.__init__, I got an error + traceback from that and could see what's wrong. That worries me a bit, because it might mean that if some initialization external commands are not fast enough, Vdsm may not be able to start.
The real fix for this issue to eliminate the storage startup code.
Storage should do the minimal initialization needed (e.g. start monitoring udev events) and wait for engine. When engine ask to connect to storage, vdsm should the needed scanning and lvm initialization that is done now during startup.
Then if some storage code fail, only connecting to this kind of storage would fail and vdsm will not panic.
As temporary solution you can disable the storage refresh thread started in HSM.__init__, and set HSM.ready to True immediately. This will have similar effect to the long term fix.
After fixing some problems, I get the panic again, even with the sleep
at the end of HSM.__init__, and again without any other error. It's difficult to proceed in such a situation -- any tips what to look at or to look for are welcome.
You can disable irs in the config, but then you will have no storage. Maybe this is good enough at this point.
Nir
FWIW, my vdsm.log now ends with
INFO (MainThread) [storage.check] Starting check service (check:91) DEBUG (check/loop) [root] START thread <Thread(check/loop, started daemon 140180434425600)> (func=<bound method EventLoop.run_forever of <EventLoop running=True closed=False at 0x140181096264704>>, args=(), kwargs={}) (concurrent:193) INFO (check/loop) [storage.asyncevent] Starting <EventLoop running=True closed=False at 0x140181096264704> (asyncevent:125) DEBUG (hsm/init) [storage.SamplingMethod] Returning last result (misc:386) DEBUG (hsm/init) [storage.SamplingMethod] Trying to enter sampling method (vdsm.storage.hba.<function rescan at 0x7f7e7a597e18>) (misc:376) DEBUG (hsm/init) [storage.SamplingMethod] Got in to sampling method (misc:379) DEBUG (hsm/init) [storage.HBA] Starting scan (hba:58) DEBUG (hsm/init) [storage.HBA] Scan finished (hba:64) DEBUG (hsm/init) [storage.SamplingMethod] Returning last result (misc:386) DEBUG (hsm/init) [root] /usr/bin/taskset --cpu-list 0-0 /sbin/udevadm settle --timeout=5 (cwd None) (commands:200) DEBUG (hsm/init) [root] SUCCESS: <err> = b''; <rc> = 0 (commands:221) DEBUG (hsm/init) [storage.SamplingMethod] Returning last result (misc:386) DEBUG (hsm/init) [storage.Misc.excCmd] /usr/bin/taskset --cpu-list 0-0 /usr/bin/sudo -n /sbin/lvm pvs --config 'devices { preferred_names=["^/dev/mapper/"] ignore_suspended_devices=1 write_cache_state=0 disable_after_error_count=3 filter=["r|.*|"] } global { locking_type=1 prioritise_write_locks=1 wait_for_locks=1 use_lvmetad=0 } backup { retain_min=50 retain_days=0 }' --noheadings --units b --nosuffix --separator '|' --ignoreskippedcluster -o uuid,name,size,vg_name,vg_uuid,pe_start,pe_count,pe_alloc_count,mda_count,dev_size,mda_used_count (cwd None) (commands:200) DEBUG (hsm/init) [storage.Misc.excCmd] SUCCESS: <err> = b''; <rc> = 0 (commands:221) DEBUG (hsm/init) [storage.Misc.excCmd] /usr/bin/taskset --cpu-list 0-0 /usr/bin/sudo -n /sbin/lvm vgs --config 'devices { preferred_names=["^/dev/mapper/"] ignore_suspended_devices=1 write_cache_state=0 disable_after_error_count=3 filter=["r|.*|"] } global { locking_type=1 prioritise_write_locks=1 wait_for_locks=1 use_lvmetad=0 } backup { retain_min=50 retain_days=0 }' --noheadings --units b --nosuffix --separator '|' --ignoreskippedcluster -o uuid,name,attr,size,free,extent_size,extent_count,free_count,tags,vg_mda_size,vg_mda_free,lv_count,pv_count,pv_name (cwd None) (commands:200) DEBUG (hsm/init) [storage.Misc.excCmd] SUCCESS: <err> = b''; <rc> = 0 (commands:221) DEBUG (hsm/init) [storage.Misc.excCmd] /usr/bin/taskset --cpu-list 0-0 /usr/bin/sudo -n /sbin/lvm lvs --config 'devices { preferred_names=["^/dev/mapper/"] ignore_suspended_devices=1 write_cache_state=0 disable_after_error_count=3 filter=["r|.*|"] } global { locking_type=1 prioritise_write_locks=1 wait_for_locks=1 use_lvmetad=0 } backup { retain_min=50 retain_days=0 }' --noheadings --units b --nosuffix --separator '|' --ignoreskippedcluster -o uuid,name,vg_name,attr,size,seg_start_pe,devices,tags (cwd None) (commands:200) DEBUG (hsm/init) [storage.Misc.excCmd] SUCCESS: <err> = b''; <rc> = 0 (commands:221) DEBUG (hsm/init) [storage.Misc.excCmd] /usr/bin/taskset --cpu-list 0-0 /usr/bin/sudo -n /sbin/lvm vgs --config 'devices { preferred_names=["^/dev/mapper/"] ignore_suspended_devices=1 write_cache_state=0 disable_after_error_count=3 filter=["r|.*|"] } global { locking_type=1 prioritise_write_locks=1 wait_for_locks=1 use_lvmetad=0 } backup { retain_min=50 retain_days=0 }' --noheadings --units b --nosuffix --separator '|' --ignoreskippedcluster -o uuid,name,attr,size,free,extent_size,extent_count,free_count,tags,vg_mda_size,vg_mda_free,lv_count,pv_count,pv_name (cwd None) (commands:200) DEBUG (hsm/init) [storage.Misc.excCmd] SUCCESS: <err> = b''; <rc> = 0 (commands:221) INFO (hsm/init) [storage.HSM] FINISH HSM init succeeded in 0.43 seconds (hsm:410) DEBUG (hsm/init) [storage.HSM] FINISH thread <Thread(hsm/init, stopped daemon 140180442818304)> (concurrent:196)
and is completely silent after that.
Thanks, Milan

Nir Soffer <nsoffer@redhat.com> writes:
Even if you port entire vdsm to python 3, sanlock does not ship yet python 3 bindings, so you cannot have any storage except local storage, iso domain and export domain.
I started working on porting sanlock Python bindings to Python 3 in order to be able to access storage. I have only my private dirty changes at the moment; if they prove to be working at least partially, I can make patches based on them and post them.
Can you check if https://gerrit.ovirt.org/c/97590/ fixes this issue?
Yes, Vdsm starts with that patch and I can go on with my Python 3 work. Thank you for explanation and help! Milan

On Wed, Feb 6, 2019 at 11:11 AM Milan Zamazal <mzamazal@redhat.com> wrote:
Nir Soffer <nsoffer@redhat.com> writes:
Even if you port entire vdsm to python 3, sanlock does not ship yet python 3 bindings, so you cannot have any storage except local storage, iso domain and export domain.
I started working on porting sanlock Python bindings to Python 3 in order to be able to access storage. I have only my private dirty changes at the moment; if they prove to be working at least partially, I can make patches based on them and post them.
Cool! Marian from the LVM team has started to work on this but did post any patches yet. Lets continue the discussion about sanlock for python 3 in sanlock-devel mailing list: https://lists.fedorahosted.org/archives/list/sanlock-devel@lists.fedorahoste...
Can you check if https://gerrit.ovirt.org/c/97590/ fixes this issue?
Yes, Vdsm starts with that patch and I can go on with my Python 3 work.
Thank you for explanation and help!
Milan
participants (2)
-
Milan Zamazal
-
Nir Soffer