在Python中解码UTF-8字符串

Question

I'm writing a web crawler in python, and it involves taking headlines from websites. 我正在python中编写一个Web爬虫，它涉及从网站上获取头条新闻。

One of the headlines should've read : And the Hip's coming, too 其中一个标题应该是：而且嘻哈也会来

But instead it said: And the Hipâ€™s coming, too 但相反它说：而且Hip也来了

What's going wrong here? 这里出了什么问题？

Answer 1

It's an encoding error - so if it's a unicode string, this ought to fix it: 这是一个编码错误 - 所以如果它是一个unicode字符串，这应该修复它：

text.encode("windows-1252").decode("utf-8")

If it's a plain string, you'll need an extra step: 如果它是一个普通的字符串，你需要一个额外的步骤：

text.decode("utf-8").encode("windows-1252").decode("utf-8")

Both of these will give you a unicode string. 这两个都会给你一个unicode字符串。

By the way - to discover how a piece of text like this has been mangled due to encoding issues, you can use chardet : 顺便说一下 - 要发现这样的文本是如何因编码问题而被破坏的，你可以使用chardet ：

>>> import chardet
>>> chardet.detect(u"And the Hipâ€™s coming, too")
{'confidence': 0.5, 'encoding': 'windows-1252'}

Answer 2

You need to properly decode the source text. 您需要正确解码源文本。 Most likely the source text is in UTF-8 format, not ASCII. 很可能源文本是UTF-8格式，而不是ASCII格式。

Because you do not provide any context or code for your question it is not possible to give a direct answer. 由于您没有为您的问题提供任何上下文或代码，因此无法给出直接答案。

I suggest you study how unicode and character encoding is done in Python: 我建议你研究如何在Python中完成unicode和字符编码：

http://docs.python.org/2/howto/unicode.html http://docs.python.org/2/howto/unicode.html

在Python中解码UTF-8字符串

问题描述

2 个解决方案

解决方案1
29 2012-10-28 16:36:30

解决方案2
11 已采纳 2012-10-28 16:26:34

在Python中解码UTF-8字符串

问题描述

2 个解决方案

解决方案1 29 2012-10-28 16:36:30

解决方案2 11 已采纳 2012-10-28 16:26:34

解决方案1
29 2012-10-28 16:36:30

解决方案2
11 已采纳 2012-10-28 16:26:34