简体   繁体   中英

Python: convert strings containing unicode code point back into normal characters

I'm working with the requests module to scrape text from a website and store it into a txt file using a method like below:

r = requests.get(url)
with open("file.txt","w") as filename:
        filename.write(r.text)

With this method, say if "送分200000" was the only string that requests got from url, it would've been decoded and stored in file.txt like below.

\u9001\u5206200000

When I grab the string from file.txt later on, the string doesn't convert back to "送分200000" and instead remains at "送分200000" when I try to print it out. For example:


with open("file.txt", "r") as filename:
        mystring = filename.readline()
        print(mystring)

Output:
"\u9001\u5206200000"

Is there a way for me to convert this string and others like it back to their original strings with unicode characters?

It's better to use the io module for that. Try and adapt the following code for your problem.

import io
with io.open(filename,'r',encoding='utf8') as f:
    text = f.read()
# process Unicode text
with io.open(filename,'w',encoding='utf8') as f:
    f.write(text)

Taken from https://www.tutorialspoint.com/How-to-read-and-write-unicode-UTF-8-files-in-Python

convert this string and others like it back to their original strings with unicode characters?

Yes, let file.txt content be

\u9001\u5206200000

then

with open("file.txt","rb") as f:
    content = f.read()
text = content.decode("unicode_escape")
print(text)

output

送分200000

If you want to know more read Text Encodings in codecs built-in module docs

I am guessing you are using Windows. When you open a file, you get its default encoding, which is Windows-1252, unless you specify otherwise. Specify the encoding when you open the file:

with open("file.txt","w", encoding="UTF-8") as filename:
        filename.write(r.text)
with open("file.txt", "r", encoding="UTF-8") as filename:
        mystring = filename.readline()
        print(mystring)

That works as you expect regardless of platform.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM