'utf8' codec can't decode byte 0xd0 in position 0: invalid continuation byte

Question

I've the following text in an html document:

<a href="#">�'ам интересна информация</a>

and I'm using the following expression for extracting the text:

row.xpath("string(./td[@class='col2 td-tags']/h3/a/text())")

This expression works fine for simple english, but for the above string it throws this error:

'utf8' codec can't decode byte 0xd0 in position 0: invalid continuation byte

Answer 1

In HTML, &#xxx does NOT specify a byte in the document encoding; it's ALWAYS a unicode codepoint.

Thus, you can't put UTF-8 into an HTML like that.

Answer 2

What encoding is the document in? What character starts the text in the <a> ? It might be an invalid UTF-8.

Answer 3

I first decoded the page contents (which included the string <a href="#"> 'ам интересна информация</a> ) to replace any not convertible strings to question mark and it worked!

ie page_contents_string = page_contents_string.decode("utf-8", "replace")

'utf8' codec can't decode byte 0xd0 in position 0: invalid continuation byte

Question

3 answers

solution1
6 2012-08-29 07:59:42

solution2
2 2012-08-29 08:11:24

solution3
1 2012-08-29 14:03:35

'utf8' codec can't decode byte 0xd0 in position 0: invalid continuation byte

Question

3 answers

solution1 6 2012-08-29 07:59:42

solution2 2 2012-08-29 08:11:24

solution3 1 2012-08-29 14:03:35

solution1
6 2012-08-29 07:59:42

solution2
2 2012-08-29 08:11:24

solution3
1 2012-08-29 14:03:35