python convert unicode code value to string, without '\u'

Question

In the below code,

text = "\u54c8\u54c8\u54c8\u54c8"

Is there a way to convert the unicode code above to keeping the value only, and remove "\\u\u0026quot; from it.

pattern = re.compile('[\u0000-\uFFFF]')

matches = pattern.finditer(text)

for match in matches:
    print(match)

Answer 1

You can use a regular list comprehension to map over the 4 characters in text , and use ord to get the ord inal (integer) of the codepoint, then hex() to convert it to hexadecimal. The [2:] slice is required to get rid of the 0x Python would otherwise add.

>>> text = "\u54c8\u54c8\u54c8\u54c8"
>>> text
'哈哈哈哈'
>>> [hex(ord(c))[2:] for c in text]
['54c8', '54c8', '54c8', '54c8']
>>>

You can then use eg "".join() if you need a single string.

(Another way to write the comprehension would be to use an f-string and the x hex format:

>>> [f'{ord(c):x}' for c in text]
['54c8', '54c8', '54c8', '54c8']

)

If you actually have a string \哈\哈\哈\哈 , ie "backslash, u, five, four, c, eight" repeated 4 times, you'll need to first decode the backslash escape sequences to get the 4-codepoint string:

>>> text = r"\u54c8\u54c8\u54c8\u54c8"
>>> codecs.decode(text, "unicode_escape")
'哈哈哈哈'

Answer 2

You can do that like this: You can ignore non-ASCII chars and encode to ASCII, or you can encode to UTF-8

text = "\u54c8\u54c8\u54c8\u54c8"
utf8string = text.encode("utf-8")
asciistring1 = text.encode("ascii", 'ignore')
asciistring2 = text.encode("ascii", 'replace')

You can refer to https://www.oreilly.com/library/view/python-cookbook/0596001673/ch03s18.html

python convert unicode code value to string, without '\u'

Question

2 answers

solution1
1 ACCPTED 2021-05-31 12:30:56

solution2
0 2021-05-31 12:52:45

python convert unicode code value to string, without '\u'

Question

2 answers

solution1 1 ACCPTED 2021-05-31 12:30:56

solution2 0 2021-05-31 12:52:45

solution1
1 ACCPTED 2021-05-31 12:30:56

solution2
0 2021-05-31 12:52:45