I downloaded the source code of a website. Through downloading the source code, and converting it into a string, many of the characters (like single quotes ('), double quotes ("), angled brackets (<, >), and forward slashes (/)) are now double escaped.
Example:
s = '\\u2018this \\/ that\\u2019'
The text represented in the website, and how i want it represented when printed out is:
this / that
My first instinct was to use regex to find all instances of 2 backslashes, and replace it with a single backslash, then use str.encode('utf-8').decode('utf-8')
to convert the 4 digit escaped Unicode characters into their actual characters:
import re
sample = '\\u2018this \\/ that\\u2019'
pattern = r'(\\)\\\1'
double_escapes_removed = re.sub(pattern, '', text)
final_text = text.encode('utf-8').decode('utf-8')
print(final_text)
should return this / that
, but the returned string appears to be completely unaltered: \‘this \\/ that\’
.
I tested the pattern individually with re.findall(pattern, text)
, and it successfully found the 3 instances of double backslashes. Beyond that, I have no idea what is going wrong
This turns out to be a bit difficult. A big part of the issue is that although '\‘' is 6 characters, '\‘' is a representation of a single character, so you can't just replace '\\u\u0026#39; with '\\u\u0026#39; and have it work.
This gets you most of the way there without having to manually iterate over escapes with regex:
>>> s.encode('ascii').decode('unicode-escape')
<<< '‘this \\/ that’'
Python 3 does output a warning about '\\/'
being an invalid unicode escape sequence, so you'd probably want to take care of those first.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.