'utf8'编解码器无法解码位置0的字节0xd0：无效的连续字节

Question

I've the following text in an html document: 我在html文件中有以下文字：

<a href="#">�'ам интересна информация</a>

and I'm using the following expression for extracting the text: 并且我使用以下表达式提取文本：

row.xpath("string(./td[@class='col2 td-tags']/h3/a/text())")

This expression works fine for simple english, but for the above string it throws this error: 该表达式对于简单的英语来说效果很好，但是对于上面的字符串，它将引发此错误：

'utf8' codec can't decode byte 0xd0 in position 0: invalid continuation byte

Answer 1

In HTML, &#xxx does NOT specify a byte in the document encoding; 在HTML中，＆＃xxx不在文档编码中指定字节； it's ALWAYS a unicode codepoint. 它总是一个Unicode代码点。

Thus, you can't put UTF-8 into an HTML like that. 因此，您不能将UTF-8放入这样的HTML中。

Answer 2

What encoding is the document in? 文档采用什么编码？ What character starts the text in the <a> ? <a>的文本以什么字符开头？ It might be an invalid UTF-8. 它可能是无效的UTF-8。

Answer 3

I first decoded the page contents (which included the string <a href="#"> 'ам интересна информация</a> ) to replace any not convertible strings to question mark and it worked! 我首先对页面内容进行了解码（其中包括字符串<a href="#"> 'ам интересна информация</a> ），以替换所有不可转换的字符串为问号，并且可以正常工作！

ie page_contents_string = page_contents_string.decode("utf-8", "replace") 即page_contents_string = page_contents_string.decode("utf-8", "replace")

'utf8'编解码器无法解码位置0的字节0xd0：无效的连续字节

问题描述

3 个解决方案

解决方案1
6 2012-08-29 07:59:42

解决方案2
2 2012-08-29 08:11:24

解决方案3
1 2012-08-29 14:03:35

&#39;utf8&#39;编解码器无法解码位置0的字节0xd0：无效的连续字节

问题描述

3 个解决方案

解决方案1 6 2012-08-29 07:59:42

解决方案2 2 2012-08-29 08:11:24

解决方案3 1 2012-08-29 14:03:35

'utf8'编解码器无法解码位置0的字节0xd0：无效的连续字节

解决方案1
6 2012-08-29 07:59:42

解决方案2
2 2012-08-29 08:11:24

解决方案3
1 2012-08-29 14:03:35