On Sun, Sep 1, 2019 at 8:23 PM Yedidyah Bar David <didi@redhat.com> wrote:
On Sun, Sep 1, 2019 at 3:37 PM Amit Bawer <abawer@redhat.com> wrote:
>
>
>
> On Sun, Sep 1, 2019 at 2:34 PM Yedidyah Bar David <didi@redhat.com> wrote:
>>
>> On Sun, Sep 1, 2019 at 1:20 PM Amit Bawer <abawer@redhat.com> wrote:
>> >
>> >
>> >
>> > On Sun, Sep 1, 2019 at 10:28 AM Yedidyah Bar David <didi@redhat.com> wrote:
>> >>
>> >> Hi all,
>> >>
>> >> That's a "sub-thread" of "unicode sandwich in otopi/engine-setup".
>> >>
>> >> I was recommended to use 'six.text_type() over "u''". I did read [1],
>> >> and eventually decided that my own preference is to just add "u"
>> >> prefix. Reasoning is inside [1].
>> >>
>> >> Do people have different preferences/reasoning they want to share?
>> >>
>> >> Do people think we should have project-wide policy re this?
>> >
>> >
>> > Since our code is currently transitioning from py2 to py2/py3, and not from py3 to py3/py2, it would be fair to assume that most
>> > already existing string literals in it contain ascii symbols, unless explicitly stated otherwise;
>> > so IMO it would only make sense to enforce 'u' over newly added literals which involve non-ascii symbols as long as py2 is still alive.
>>
>> Not exactly.
>>
>> Suppose (mostly correctly) that the code didn't employ the "unicode
>> sandwich" technique so far. Meaning, much was handled as python2 str
>> objects containing utf-8-encoded strings, and converted to unicode
>> objects mainly as needed/noted/considered. Suppose that x is a
>> variable that used to contain such an str, usually ascii-only, but
>> sometimes perhaps utf-8. Now, this:
>>
>> 'x: {}'.format(x)
>>
>> would work, and replace {} with the contents of x, and return a
>> python2 str, utf-8-encoded if x is utf-8. But if now x contains a
>> unicode object (because we decided to follow the sandwich approach,
>> and encode all utf-8 during input), it would fail, if x is not
>> ascii-only. Adding u to 'x: {}' solves this.
>
>
> utf-8 is an ascii extension, meaning that first 128 ordinals agree for both encodings, so unicode sandwich has no negative effect on your example.
> It would be only a problem only if input for x originally had a non-ascii character in it, but that should have been an issue for py2 in the first place, regardless to py3 sandwiches.

Let me clarify:

Thanks, now i see where i was wrong.


In python2:

If I start with:

x='א'

py2: x is 2 bytes: '\xd7\x90'
py3: x is unicode str with a single symbol '\u05d0'


'{}'.format(x)

Works. 

py2: two bytes, each is < 128, so its fine.
py3: default unicode string, so its fine.
 

If I then employ the sandwich, and therefore effectively change the code to be:

x=u'א'

now py2 and py3 agree on contents of x, so sandwiching seems like the right choice to make sure they treat x the same way.


'{}'.format(x)

Fails.

To fix, I can change it to:

u'{}'.format(x)

seems like a legit option to bridge the default encoding gap between py2 and py3


Or, to import unicode_literals and keep the existing code line(s).

Both work.

In actual code, the assignment to x will/might be in a different
module, and/or not contain a literal but user input, but '{}' _will_
be a literal.

Do people have preferences? Can people share their reasoning for their
preferences? Do you think we should have policies, or it's up to each
git repo, or even each patch author+maintainers/reviewers to decide?

As discussed in the original [1], both have pros and cons. Personally
I prefer "u''". But not strongly, because we try to keep our modules
rather small, so it's not like you add a single import line that
changes the semantics of hundreds or thousands of lines. Usually, it's
rather easy to decide that such an import is ok. Ideally, we'd have
full code coverage in our tests, including utf-8 everywhere, but I
think we are quite far from that, for now.

Thanks and best regards,

>
>>
>> So I have to handle also all existing such literals, at least those
>> that would now require handling unicode vars.
>>
>> >
>> >>
>> >>
>> >> Personally, I do not see the big advantage of adding "six.text_type()"
>> >> (15 chars) instead of a single "u". I do see where it can be useful,
>> >> but not as a very long replacement, IMO, for "u", or for
>> >> unicode_literals.
>> >
>> >
>> > Once py2 will be officially terminated, probably neither option mentioned above would be meaningful as unicode is py3's default string encoding;
>> > however IMO for literals it seems that an explicit 'u' is a more native approach, and provides clarity about the intentions of the programmer compared
>> > to a global switch button in the form of import unicode_literals. Using six.text_type() is probably a good solution nowadays for variables and not literals,
>> > and would probably have to die off some day after py2 does the same.
>> >
>> >>
>> >>
>> >> Thanks and best regards,
>> >>
>> >> [1] http://python-future.org/unicode_literals.html
>> >> --
>> >> Didi
>> >> _______________________________________________
>> >> Devel mailing list -- devel@ovirt.org
>> >> To unsubscribe send an email to devel-leave@ovirt.org
>> >> Privacy Statement: https://www.ovirt.org/site/privacy-policy/
>> >> oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/
>> >> List Archives: https://lists.ovirt.org/archives/list/devel@ovirt.org/message/SW3P4VOGBP43N54CQEH3YURN6X5ZMWIX/
>>
>>
>>
>> --
>> Didi



--
Didi