简体   繁体   中英

Python - escaped 4 digit escaped unicode characters

I downloaded the source code of a website. Through downloading the source code, and converting it into a string, many of the characters (like single quotes ('), double quotes ("), angled brackets (<, >), and forward slashes (/)) are now double escaped.

Example:

s = '\\u2018this \\/ that\\u2019'

The text represented in the website, and how i want it represented when printed out is:

this / that

My first instinct was to use regex to find all instances of 2 backslashes, and replace it with a single backslash, then use str.encode('utf-8').decode('utf-8') to convert the 4 digit escaped Unicode characters into their actual characters:

import re
sample = '\\u2018this \\/ that\\u2019'
pattern = r'(\\)\\\1'
double_escapes_removed = re.sub(pattern, '', text)
final_text = text.encode('utf-8').decode('utf-8')

print(final_text) should return this / that , but the returned string appears to be completely unaltered: \‘this \\/ that\’ .

I tested the pattern individually with re.findall(pattern, text) , and it successfully found the 3 instances of double backslashes. Beyond that, I have no idea what is going wrong

This turns out to be a bit difficult. A big part of the issue is that although '\‘' is 6 characters, '\‘' is a representation of a single character, so you can't just replace '\\u\u0026#39; with '\\u\u0026#39; and have it work.

This gets you most of the way there without having to manually iterate over escapes with regex:

>>> s.encode('ascii').decode('unicode-escape')
<<< '‘this \\/ that’'

Python 3 does output a warning about '\\/' being an invalid unicode escape sequence, so you'd probably want to take care of those first.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM