unicode_literrals vs "u''" vs six.text_type

newer
URGENT - ovirt-engine master is...

Yedidyah Bar David

1 Sep 2019 1 Sep '19

9:26 a.m.

Hi all, That's a "sub-thread" of "unicode sandwich in otopi/engine-setup". I was recommended to use 'six.text_type() over "u''". I did read [1], and eventually decided that my own preference is to just add "u" prefix. Reasoning is inside [1]. Do people have different preferences/reasoning they want to share? Do people think we should have project-wide policy re this? Personally, I do not see the big advantage of adding "six.text_type()" (15 chars) instead of a single "u". I do see where it can be useful, but not as a very long replacement, IMO, for "u", or for unicode_literrals. Thanks and best regards, [1] http://python-future.org/unicode_literals.html -- Didi

Show replies by date

Amit Bawer

1 Sep 1 Sep

12:20 p.m.

On Sun, Sep 1, 2019 at 10:28 AM Yedidyah Bar David <didi@redhat.com> wrote:

...

Hi all,

That's a "sub-thread" of "unicode sandwich in otopi/engine-setup".

I was recommended to use 'six.text_type() over "u''". I did read [1], and eventually decided that my own preference is to just add "u" prefix. Reasoning is inside [1].

Do people have different preferences/reasoning they want to share?

Do people think we should have project-wide policy re this?

Since our code is currently transitioning from py2 to py2/py3, and not from py3 to py3/py2, it would be fair to assume that most already existing string literals in it contain ascii symbols, unless explicitly stated otherwise; so IMO it would only make sense to enforce 'u' over newly added literals which involve non-ascii symbols as long as py2 is still alive.

...

Personally, I do not see the big advantage of adding "six.text_type()" (15 chars) instead of a single "u". I do see where it can be useful, but not as a very long replacement, IMO, for "u", or for unicode_literals.

Once py2 will be officially terminated, probably neither option mentioned above would be meaningful as unicode is py3's default string encoding; however IMO for literals it seems that an explicit 'u' is a more native approach, and provides clarity about the intentions of the programmer compared to a global switch button in the form of import unicode_literals. Using six.text_type() is probably a good solution nowadays for variables and not literals, and would probably have to die off some day after py2 does the same.

...

Thanks and best regards,

[1] http://python-future.org/unicode_literals.html -- Didi _______________________________________________ Devel mailing list -- devel@ovirt.org To unsubscribe send an email to devel-leave@ovirt.org Privacy Statement: https://www.ovirt.org/site/privacy-policy/ oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/devel@ovirt.org/message/SW3P4VOGBP43N5...

Yedidyah Bar David

1:33 p.m.

On Sun, Sep 1, 2019 at 1:20 PM Amit Bawer <abawer@redhat.com> wrote:

...

On Sun, Sep 1, 2019 at 10:28 AM Yedidyah Bar David <didi@redhat.com> wrote:

...
Hi all,

That's a "sub-thread" of "unicode sandwich in otopi/engine-setup".

I was recommended to use 'six.text_type() over "u''". I did read [1], and eventually decided that my own preference is to just add "u" prefix. Reasoning is inside [1].

Do people have different preferences/reasoning they want to share?

Do people think we should have project-wide policy re this?

Since our code is currently transitioning from py2 to py2/py3, and not from py3 to py3/py2, it would be fair to assume that most already existing string literals in it contain ascii symbols, unless explicitly stated otherwise; so IMO it would only make sense to enforce 'u' over newly added literals which involve non-ascii symbols as long as py2 is still alive.

Not exactly. Suppose (mostly correctly) that the code didn't employ the "unicode sandwich" technique so far. Meaning, much was handled as python2 str objects containing utf-8-encoded strings, and converted to unicode objects mainly as needed/noted/considered. Suppose that x is a variable that used to contain such an str, usually ascii-only, but sometimes perhaps utf-8. Now, this: 'x: {}'.format(x) would work, and replace {} with the contents of x, and return a python2 str, utf-8-encoded if x is utf-8. But if now x contains a unicode object (because we decided to follow the sandwich approach, and encode all utf-8 during input), it would fail, if x is not ascii-only. Adding u to 'x: {}' solves this. So I have to handle also all existing such literals, at least those that would now require handling unicode vars.

...

...
Personally, I do not see the big advantage of adding "six.text_type()" (15 chars) instead of a single "u". I do see where it can be useful, but not as a very long replacement, IMO, for "u", or for unicode_literals.

Once py2 will be officially terminated, probably neither option mentioned above would be meaningful as unicode is py3's default string encoding; however IMO for literals it seems that an explicit 'u' is a more native approach, and provides clarity about the intentions of the programmer compared to a global switch button in the form of import unicode_literals. Using six.text_type() is probably a good solution nowadays for variables and not literals, and would probably have to die off some day after py2 does the same.

...
Thanks and best regards,

[1] http://python-future.org/unicode_literals.html -- Didi _______________________________________________ Devel mailing list -- devel@ovirt.org To unsubscribe send an email to devel-leave@ovirt.org Privacy Statement: https://www.ovirt.org/site/privacy-policy/ oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/devel@ovirt.org/message/SW3P4VOGBP43N5...

-- Didi

Amit Bawer

2:36 p.m.

On Sun, Sep 1, 2019 at 2:34 PM Yedidyah Bar David <didi@redhat.com> wrote:

...

On Sun, Sep 1, 2019 at 1:20 PM Amit Bawer <abawer@redhat.com> wrote:

...
On Sun, Sep 1, 2019 at 10:28 AM Yedidyah Bar David <didi@redhat.com>

wrote:

...
...
Hi all,

That's a "sub-thread" of "unicode sandwich in otopi/engine-setup".

I was recommended to use 'six.text_type() over "u''". I did read [1], and eventually decided that my own preference is to just add "u" prefix. Reasoning is inside [1].

Do people have different preferences/reasoning they want to share?

Do people think we should have project-wide policy re this?

Since our code is currently transitioning from py2 to py2/py3, and not from py3 to py3/py2, it would be fair to assume that most already existing string literals in it contain ascii symbols, unless explicitly stated otherwise; so IMO it would only make sense to enforce 'u' over newly added literals which involve non-ascii symbols as long as py2 is still alive.

Not exactly.

Suppose (mostly correctly) that the code didn't employ the "unicode sandwich" technique so far. Meaning, much was handled as python2 str objects containing utf-8-encoded strings, and converted to unicode objects mainly as needed/noted/considered. Suppose that x is a variable that used to contain such an str, usually ascii-only, but sometimes perhaps utf-8. Now, this:

'x: {}'.format(x)

would work, and replace {} with the contents of x, and return a python2 str, utf-8-encoded if x is utf-8. But if now x contains a unicode object (because we decided to follow the sandwich approach, and encode all utf-8 during input), it would fail, if x is not ascii-only. Adding u to 'x: {}' solves this.

utf-8 is an ascii extension, meaning that first 128 ordinals agree for both encodings, so unicode sandwich has no negative effect on your example. It would be only a problem only if input for x originally had a non-ascii character in it, but that should have been an issue for py2 in the first place, regardless to py3 sandwiches.

...

So I have to handle also all existing such literals, at least those that would now require handling unicode vars.

...
...
Personally, I do not see the big advantage of adding "six.text_type()" (15 chars) instead of a single "u". I do see where it can be useful, but not as a very long replacement, IMO, for "u", or for unicode_literals.

Once py2 will be officially terminated, probably neither option

...
however IMO for literals it seems that an explicit 'u' is a more native approach, and provides clarity about the intentions of the programmer compared to a global switch button in the form of import unicode_literals. Using six.text_type() is probably a good solution nowadays for variables and not

mentioned above would be meaningful as unicode is py3's default string encoding; literals,

...
and would probably have to die off some day after py2 does the same.

...
Thanks and best regards,

[1] http://python-future.org/unicode_literals.html -- Didi _______________________________________________ Devel mailing list -- devel@ovirt.org To unsubscribe send an email to devel-leave@ovirt.org Privacy Statement: https://www.ovirt.org/site/privacy-policy/ oVirt Code of Conduct:

https://www.ovirt.org/community/about/community-guidelines/

...
List Archives: https://lists.ovirt.org/archives/list/devel@ovirt.org/message/SW3P4VOGBP43N5...

-- Didi

Yedidyah Bar David

7:23 p.m.

On Sun, Sep 1, 2019 at 3:37 PM Amit Bawer <abawer@redhat.com> wrote:

...

On Sun, Sep 1, 2019 at 2:34 PM Yedidyah Bar David <didi@redhat.com> wrote:

...
On Sun, Sep 1, 2019 at 1:20 PM Amit Bawer <abawer@redhat.com> wrote:

...
On Sun, Sep 1, 2019 at 10:28 AM Yedidyah Bar David <didi@redhat.com> wrote:

...
Hi all,

That's a "sub-thread" of "unicode sandwich in otopi/engine-setup".

I was recommended to use 'six.text_type() over "u''". I did read [1], and eventually decided that my own preference is to just add "u" prefix. Reasoning is inside [1].

Do people have different preferences/reasoning they want to share?

Do people think we should have project-wide policy re this?

Since our code is currently transitioning from py2 to py2/py3, and not from py3 to py3/py2, it would be fair to assume that most already existing string literals in it contain ascii symbols, unless explicitly stated otherwise; so IMO it would only make sense to enforce 'u' over newly added literals which involve non-ascii symbols as long as py2 is still alive.

Not exactly.

Suppose (mostly correctly) that the code didn't employ the "unicode sandwich" technique so far. Meaning, much was handled as python2 str objects containing utf-8-encoded strings, and converted to unicode objects mainly as needed/noted/considered. Suppose that x is a variable that used to contain such an str, usually ascii-only, but sometimes perhaps utf-8. Now, this:

'x: {}'.format(x)

would work, and replace {} with the contents of x, and return a python2 str, utf-8-encoded if x is utf-8. But if now x contains a unicode object (because we decided to follow the sandwich approach, and encode all utf-8 during input), it would fail, if x is not ascii-only. Adding u to 'x: {}' solves this.

utf-8 is an ascii extension, meaning that first 128 ordinals agree for both encodings, so unicode sandwich has no negative effect on your example. It would be only a problem only if input for x originally had a non-ascii character in it, but that should have been an issue for py2 in the first place, regardless to py3 sandwiches.

Let me clarify: In python2: If I start with: x='א' '{}'.format(x) Works. If I then employ the sandwich, and therefore effectively change the code to be: x=u'א' '{}'.format(x) Fails. To fix, I can change it to: u'{}'.format(x) Or, to import unicode_literals and keep the existing code line(s). Both work. In actual code, the assignment to x will/might be in a different module, and/or not contain a literal but user input, but '{}' _will_ be a literal. Do people have preferences? Can people share their reasoning for their preferences? Do you think we should have policies, or it's up to each git repo, or even each patch author+maintainers/reviewers to decide? As discussed in the original [1], both have pros and cons. Personally I prefer "u''". But not strongly, because we try to keep our modules rather small, so it's not like you add a single import line that changes the semantics of hundreds or thousands of lines. Usually, it's rather easy to decide that such an import is ok. Ideally, we'd have full code coverage in our tests, including utf-8 everywhere, but I think we are quite far from that, for now. Thanks and best regards,

...

...
So I have to handle also all existing such literals, at least those that would now require handling unicode vars.

...
...
Personally, I do not see the big advantage of adding "six.text_type()" (15 chars) instead of a single "u". I do see where it can be useful, but not as a very long replacement, IMO, for "u", or for unicode_literals.

Once py2 will be officially terminated, probably neither option mentioned above would be meaningful as unicode is py3's default string encoding; however IMO for literals it seems that an explicit 'u' is a more native approach, and provides clarity about the intentions of the programmer compared to a global switch button in the form of import unicode_literals. Using six.text_type() is probably a good solution nowadays for variables and not literals, and would probably have to die off some day after py2 does the same.

...
Thanks and best regards,

[1] http://python-future.org/unicode_literals.html -- Didi _______________________________________________ Devel mailing list -- devel@ovirt.org To unsubscribe send an email to devel-leave@ovirt.org Privacy Statement: https://www.ovirt.org/site/privacy-policy/ oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/devel@ovirt.org/message/SW3P4VOGBP43N5...

-- Didi

-- Didi

Amit Bawer

2 Sep 2 Sep

9:25 a.m.

On Sun, Sep 1, 2019 at 8:23 PM Yedidyah Bar David <didi@redhat.com> wrote:

...

On Sun, Sep 1, 2019 at 3:37 PM Amit Bawer <abawer@redhat.com> wrote:

...
On Sun, Sep 1, 2019 at 2:34 PM Yedidyah Bar David <didi@redhat.com>

...
...
On Sun, Sep 1, 2019 at 1:20 PM Amit Bawer <abawer@redhat.com> wrote:

...
On Sun, Sep 1, 2019 at 10:28 AM Yedidyah Bar David <didi@redhat.com>

wrote:

...
...
...
Hi all,

That's a "sub-thread" of "unicode sandwich in otopi/engine-setup".

I was recommended to use 'six.text_type() over "u''". I did read [1], and eventually decided that my own preference is to just add "u" prefix. Reasoning is inside [1].

Do people have different preferences/reasoning they want to share?

Do people think we should have project-wide policy re this?

Since our code is currently transitioning from py2 to py2/py3, and not from py3 to py3/py2, it would be fair to assume that most already existing string literals in it contain ascii symbols, unless explicitly stated otherwise; so IMO it would only make sense to enforce 'u' over newly added

...
...
Not exactly.

Suppose (mostly correctly) that the code didn't employ the "unicode sandwich" technique so far. Meaning, much was handled as python2 str objects containing utf-8-encoded strings, and converted to unicode objects mainly as needed/noted/considered. Suppose that x is a variable that used to contain such an str, usually ascii-only, but sometimes perhaps utf-8. Now, this:

'x: {}'.format(x)

would work, and replace {} with the contents of x, and return a python2 str, utf-8-encoded if x is utf-8. But if now x contains a unicode object (because we decided to follow the sandwich approach, and encode all utf-8 during input), it would fail, if x is not ascii-only. Adding u to 'x: {}' solves this.

utf-8 is an ascii extension, meaning that first 128 ordinals agree for both encodings, so unicode sandwich has no negative effect on your example. It would be only a problem only if input for x originally had a non-ascii character in it, but that should have been an issue for py2 in

wrote: literals which involve non-ascii symbols as long as py2 is still alive. the first place, regardless to py3 sandwiches.

Let me clarify:

Thanks, now i see where i was wrong.

...

In python2:

If I start with:

x='א'

py2: x is 2 bytes: '\xd7\x90' py3: x is unicode str with a single symbol '\u05d0'

...

'{}'.format(x)

Works.

py2: two bytes, each is < 128, so its fine. py3: default unicode string, so its fine.

...

If I then employ the sandwich, and therefore effectively change the code to be:

x=u'א'

now py2 and py3 agree on contents of x, so sandwiching seems like the right choice to make sure they treat x the same way.

...

'{}'.format(x)

Fails.

To fix, I can change it to:

u'{}'.format(x)

seems like a legit option to bridge the default encoding gap between py2 and py3

...

Or, to import unicode_literals and keep the existing code line(s).

Both work.

In actual code, the assignment to x will/might be in a different module, and/or not contain a literal but user input, but '{}' _will_ be a literal.

Do people have preferences? Can people share their reasoning for their preferences? Do you think we should have policies, or it's up to each git repo, or even each patch author+maintainers/reviewers to decide?

As discussed in the original [1], both have pros and cons. Personally I prefer "u''". But not strongly, because we try to keep our modules rather small, so it's not like you add a single import line that changes the semantics of hundreds or thousands of lines. Usually, it's rather easy to decide that such an import is ok. Ideally, we'd have full code coverage in our tests, including utf-8 everywhere, but I think we are quite far from that, for now.

Thanks and best regards,

...
...
So I have to handle also all existing such literals, at least those that would now require handling unicode vars.

...
...
Personally, I do not see the big advantage of adding

...
...
...
...
(15 chars) instead of a single "u". I do see where it can be useful, but not as a very long replacement, IMO, for "u", or for unicode_literals.

Once py2 will be officially terminated, probably neither option mentioned above would be meaningful as unicode is py3's default string encoding; however IMO for literals it seems that an explicit 'u' is a more native approach, and provides clarity about the intentions of the

"six.text_type()" programmer compared

...
...
...
to a global switch button in the form of import unicode_literals. Using six.text_type() is probably a good solution nowadays for variables and not literals, and would probably have to die off some day after py2 does the same.

...
Thanks and best regards,

[1] http://python-future.org/unicode_literals.html -- Didi _______________________________________________ Devel mailing list -- devel@ovirt.org To unsubscribe send an email to devel-leave@ovirt.org Privacy Statement: https://www.ovirt.org/site/privacy-policy/ oVirt Code of Conduct:

https://www.ovirt.org/community/about/community-guidelines/

...
List Archives: https://lists.ovirt.org/archives/list/devel@ovirt.org/message/SW3P4VOGBP43N5...

-- Didi

-- Didi

2388

Age (days ago)

2389

Last active (days ago)

List overview

Download

5 comments

2 participants

participants (2)

Amit Bawer
Yedidyah Bar David