I've the following text in an html document:
<a href="#">�'ам интересна информация</a>
and I'm using the following expression for extracting the text:
row.xpath("string(./td[@class='col2 td-tags']/h3/a/text())")
This expression works fine for simple english, but for the above string it throws this error:
'utf8' codec can't decode byte 0xd0 in position 0: invalid continuation byte
In HTML, &#xxx does NOT specify a byte in the document encoding; it's ALWAYS a unicode codepoint.
Thus, you can't put UTF-8 into an HTML like that.
What encoding is the document in? What character starts the text in the <a>
? It might be an invalid UTF-8.
I first decoded the page contents (which included the string <a href="#"> 'ам интересна информация</a>
) to replace any not convertible strings to question mark and it worked!
ie page_contents_string = page_contents_string.decode("utf-8", "replace")
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.