简体   繁体   中英

How to convert UTF-8 notation to python unicode notation

Using python3.8 I would like to convert unicode notation to python notation:

s = 'U+00A0'
result = s.lower() # output  'u+00a0'

I want to replace u+ with \\u\u003c/code> :

result = s.lower().replace('u+','\u') 

But I get the error:

SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 0-1: truncated \uXXXX escape

How can I convert the notation U+00A0 to ?

EDIT:

The reason I wanted to get is to further use encode method to get b'\\xc2\\xa0' .

My question: given a string in the following notation U+00A0 I would like to convert it to byte code b'\\xc2\\xa0'

you are struggling with the representation of something versus its value...

import re
re.sub("u\+([0-9a-f]{4})",lambda m:chr(int(m.group(1),16)),s)

but for u+00a0 this becomes \\xa0

but same with the literal \ 

s = "\u00a0"
print(repr(s))

once you have the proper value as a unicode string you can then encode it to utf8

s = "\xa0"
print(s.encode('utf8'))
# b'\xc2\xa0'

so just final answer here

import re
s = "u+00a0"
s2 = re.sub("u\+([0-9a-f]{4})",lambda m:chr(int(m.group(1),16)),s)
s_bytes = s2.encode('utf8') # b'\xc2\xa0'

You can also use this:

>>> s = 'U+00A0'
>>> s = s.replace('U+', '\\u').encode().decode('unicode_escape').encode()
>>> s
b'\xc2\xa0'

You need to escape the \\ in replace with a second \\ :

result = s.lower().replace('u+','\\u') 
print(result)

will give you

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM