On Tue, Feb 5, 2019 at 9:25 PM Nir Soffer <nsoffer@redhat.com> wrote:
On Tue, Feb 5, 2019 at 7:15 PM Milan Zamazal <mzamazal@redhat.com> wrote:
Hi, I'm trying to run Vdsm with Python 3, which exposes various Python 3
incompatibilities in Vdsm code.  I'm currently semi-stuck on Vdsm
crashing very shortly after startup with:

  Panic: Error initializing IRS
  Traceback (most recent call last):
    File "/usr/lib/python3.6/site-packages/vdsm/vdsmd.py", line 95, in serve_clients
      irs = Dispatcher(HSM())
    File "/usr/lib/python3.6/site-packages/vdsm/storage/dispatcher.py", line 46, in __init__
      self._exposeFunctions(obj)
    File "/usr/lib/python3.6/site-packages/vdsm/storage/dispatcher.py", line 56, in _exposeFunctions
      if hasattr(funcObj, _EXPORTED_ATTRIBUTE) and callable(funcObj):
    File "/usr/lib/python3.6/site-packages/vdsm/storage/sp.py", line 85, in __getattr__
      raise se.StoragePoolNotConnected
  vdsm.storage.exception.StoragePoolNotConnected: Storage pool not connected: ()

This should be fixed first. exposing HSM function should not raise this error.

Can you check if https://gerrit.ovirt.org/c/97590/ fixes this issue?


The best way would be to write tests for the dispatcher and run the tests with python 3.
When the dispatcher is ported to python 3 you can try again to start vdsm.

Even if you port entire vdsm to python 3, sanlock does not ship yet
python 3 bindings, so you cannot have any storage except local storage,
iso domain and export domain.

That wouldn't be a problem by itself, errors are expected in Python 3
experiments.  However the problem is that this is often the *only* error
I receive.

I've found out that HSM() call spawns threads, which apparently don't
have chance to report their errors before the panic above kills Vdsm.
For instance, there was a string-byte mismatch in running iscsiadm
scanning commands.  Only after I added time.sleep to the end of
HSM.__init__, I got an error + traceback from that and could see what's
wrong.  That worries me a bit, because it might mean that if some
initialization external commands are not fast enough, Vdsm may not be
able to start.

The real fix for this issue to eliminate the storage startup code.

Storage should do the minimal initialization needed (e.g. start monitoring
udev events) and wait for engine. When engine ask to connect to storage,
vdsm should the needed scanning and lvm initialization that is done now
during startup.

Then if some storage code fail, only connecting to this kind of storage 
would fail and vdsm will not panic.

As temporary solution you can disable the storage refresh thread
started in HSM.__init__, and set HSM.ready to True immediately.
This will have similar effect to the long term fix.

After fixing some problems, I get the panic again, even with the sleep
at the end of HSM.__init__, and again without any other error.  It's
difficult to proceed in such a situation -- any tips what to look at or
to look for are welcome.

You can disable irs in the config, but then you will have no storage.
Maybe this is good enough at this point.

Nir
 

FWIW, my vdsm.log now ends with

  INFO  (MainThread) [storage.check] Starting check service (check:91)
  DEBUG (check/loop) [root] START thread <Thread(check/loop, started daemon 140180434425600)> (func=<bound method EventLoop.run_forever of <EventLoop running=True closed=False at 0x140181096264704>>, args=(), kwargs={}) (concurrent:193)
  INFO  (check/loop) [storage.asyncevent] Starting <EventLoop running=True closed=False at 0x140181096264704> (asyncevent:125)
  DEBUG (hsm/init) [storage.SamplingMethod] Returning last result (misc:386)
  DEBUG (hsm/init) [storage.SamplingMethod] Trying to enter sampling method (vdsm.storage.hba.<function rescan at 0x7f7e7a597e18>) (misc:376)
  DEBUG (hsm/init) [storage.SamplingMethod] Got in to sampling method (misc:379)
  DEBUG (hsm/init) [storage.HBA] Starting scan (hba:58)
  DEBUG (hsm/init) [storage.HBA] Scan finished (hba:64)
  DEBUG (hsm/init) [storage.SamplingMethod] Returning last result (misc:386)
  DEBUG (hsm/init) [root] /usr/bin/taskset --cpu-list 0-0 /sbin/udevadm settle --timeout=5 (cwd None) (commands:200)
  DEBUG (hsm/init) [root] SUCCESS: <err> = b''; <rc> = 0 (commands:221)
  DEBUG (hsm/init) [storage.SamplingMethod] Returning last result (misc:386)
  DEBUG (hsm/init) [storage.Misc.excCmd] /usr/bin/taskset --cpu-list 0-0 /usr/bin/sudo -n /sbin/lvm pvs --config 'devices {  preferred_names=["^/dev/mapper/"]  ignore_suspended_devices=1  write_cache_state=0  disable_after_error_count=3  filter=["r|.*|"] } global {  locking_type=1  prioritise_write_locks=1  wait_for_locks=1  use_lvmetad=0 } backup {  retain_min=50  retain_days=0 }' --noheadings --units b --nosuffix --separator '|' --ignoreskippedcluster -o uuid,name,size,vg_name,vg_uuid,pe_start,pe_count,pe_alloc_count,mda_count,dev_size,mda_used_count (cwd None) (commands:200)
  DEBUG (hsm/init) [storage.Misc.excCmd] SUCCESS: <err> = b''; <rc> = 0 (commands:221)
  DEBUG (hsm/init) [storage.Misc.excCmd] /usr/bin/taskset --cpu-list 0-0 /usr/bin/sudo -n /sbin/lvm vgs --config 'devices {  preferred_names=["^/dev/mapper/"]  ignore_suspended_devices=1  write_cache_state=0  disable_after_error_count=3  filter=["r|.*|"] } global {  locking_type=1  prioritise_write_locks=1  wait_for_locks=1  use_lvmetad=0 } backup {  retain_min=50  retain_days=0 }' --noheadings --units b --nosuffix --separator '|' --ignoreskippedcluster -o uuid,name,attr,size,free,extent_size,extent_count,free_count,tags,vg_mda_size,vg_mda_free,lv_count,pv_count,pv_name (cwd None) (commands:200)
  DEBUG (hsm/init) [storage.Misc.excCmd] SUCCESS: <err> = b''; <rc> = 0 (commands:221)
  DEBUG (hsm/init) [storage.Misc.excCmd] /usr/bin/taskset --cpu-list 0-0 /usr/bin/sudo -n /sbin/lvm lvs --config 'devices {  preferred_names=["^/dev/mapper/"]  ignore_suspended_devices=1  write_cache_state=0  disable_after_error_count=3  filter=["r|.*|"] } global {  locking_type=1  prioritise_write_locks=1  wait_for_locks=1  use_lvmetad=0 } backup {  retain_min=50  retain_days=0 }' --noheadings --units b --nosuffix --separator '|' --ignoreskippedcluster -o uuid,name,vg_name,attr,size,seg_start_pe,devices,tags (cwd None) (commands:200)
  DEBUG (hsm/init) [storage.Misc.excCmd] SUCCESS: <err> = b''; <rc> = 0 (commands:221)
  DEBUG (hsm/init) [storage.Misc.excCmd] /usr/bin/taskset --cpu-list 0-0 /usr/bin/sudo -n /sbin/lvm vgs --config 'devices {  preferred_names=["^/dev/mapper/"]  ignore_suspended_devices=1  write_cache_state=0  disable_after_error_count=3  filter=["r|.*|"] } global {  locking_type=1  prioritise_write_locks=1  wait_for_locks=1  use_lvmetad=0 } backup {  retain_min=50  retain_days=0 }' --noheadings --units b --nosuffix --separator '|' --ignoreskippedcluster -o uuid,name,attr,size,free,extent_size,extent_count,free_count,tags,vg_mda_size,vg_mda_free,lv_count,pv_count,pv_name (cwd None) (commands:200)
  DEBUG (hsm/init) [storage.Misc.excCmd] SUCCESS: <err> = b''; <rc> = 0 (commands:221)
  INFO  (hsm/init) [storage.HSM] FINISH HSM init succeeded in 0.43 seconds (hsm:410)
  DEBUG (hsm/init) [storage.HSM] FINISH thread <Thread(hsm/init, stopped daemon 140180442818304)> (concurrent:196)

and is completely silent after that.

Thanks,
Milan