如何在Python中将UTF-8代码转换为符号字符

Question

I crawled some webpages using python's urllib.request API and saved the read lines into a new file. 我使用python的urllib.request API抓取了一些网页，并将读取的行保存到一个新文件中。

        f = open(docId + ".html", "w+")
        with urllib.request.urlopen('http://stackoverflow.com') as u:
              s = u.read()
              f.write(str(s))

But when I open the saved files, I see many strings such as \\xe2\\x86\\x90, which was originally an arrow symbol in the original page. 但是当我打开保存的文件时，我看到许多字符串，例如\\ xe2 \\ x86 \\ x90，它原来是原始页面中的箭头符号。 It seems to be a UTF-8 code of the symbol, but how do I convert the code to the symbol back? 它似乎是符号的UTF-8代码，但是如何将代码转换回符号呢？

Answer 1

Your code is broken: u.read() returns bytes object. 你的代码坏了： u.read()返回bytes对象。 str(bytes_object) returns a string representation of the object (how the bytes literal would look like) -- you don't want it here: str(bytes_object)返回对象的字符串表示形式 （字节文字的外观） - 你不想在这里：

>>> str(b'\xe2\x86\x90')
"b'\\xe2\\x86\\x90'"

Either save the bytes on disk as is: 按原样保存磁盘上的字节：

import urllib.request

urllib.request.urlretrieve('http://stackoverflow.com', 'so.html')

or open the file in binary mode: 'wb' and save it manually: 或以二进制模式打开文件： 'wb'并手动保存：

import shutil
from urllib.request import urlopen

with urlopen('http://stackoverflow.com') as u, open('so.html', 'wb') as file:
    shutil.copyfileobj(u, file)

or convert bytes to Unicode and save them to disk using any encoding you like. 或者将字节转换为Unicode并使用您喜欢的任何编码将它们保存到磁盘。

import io
import shutil
from urllib.request import urlopen

with urlopen('http://stackoverflow.com') as u, \
     open('so.html', 'w', encoding='utf-8', newline='') as file, \
     io.TextIOWrapper(u, encoding=u.headers.get_content_charset('utf-8'), newline='') as t:
    shutil.copyfileobj(t, file)

Answer 2

Try: 尝试：

import urllib2, io

with io.open("test.html", "w", encoding='utf8') as fout:
    s = urllib2.urlopen('http://stackoverflow.com').read()
    s = s.decode('utf8', 'ignore') # or s.decode('utf8', 'replace')
    fout.write(s)

See https://docs.python.org/2/howto/unicode.html 请参阅https://docs.python.org/2/howto/unicode.html

如何在Python中将UTF-8代码转换为符号字符

问题描述

2 个解决方案

解决方案1
2 已采纳 2015-01-23 14:04:08

解决方案2
1 2015-01-23 07:38:14

如何在Python中将UTF-8代码转换为符号字符

问题描述

2 个解决方案

解决方案1 2 已采纳 2015-01-23 14:04:08

解决方案2 1 2015-01-23 07:38:14

解决方案1
2 已采纳 2015-01-23 14:04:08

解决方案2
1 2015-01-23 07:38:14