简体   繁体   中英

Python - converting a non-English UTF-8-encoded string into a list of characters

I have a UTF-8 encoded string containing both English and non-english characters. I am trying to convert this string to a list of single characters. When I just use list(), some of the non-English letters are cut in the middle. For example:

In [200]: s = "abאב"

In [201]: print s
abאב

In [202]: l = list(s)

In [203]: print l
['a', 'b', '\xd7', '\x90', '\xd7', '\x91']

In [204]: print l[2]
�

In [205]: print l[2]+l[3]
א

l[2] prints gibberish since the encoding of the letter א is \\xd7\\x90 and not \\xd7. How can I adequately split the string?

Thanks.

I assume you run Python 2.7

If you will work a lot with UTF-8 you should consider running Python 3. In Python 3 it works as you would expect.

print(l)
['a', 'b', 'א', 'ב']
print(l[2])
א

I assume you are using python2:

>>> list(s.decode('utf8'))       
[u'a', u'b', u'\u05d0', u'\u05d1']

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM