UnicodeDecodeError when extracting comments from a web page using lxml and xpath

Question

Hi I am trying to extract comments on a web page using lxml and xpath. Here is my code:

pg = requests.get('https://www.makeupalley.com/product/showreview.asp/ItemId=164662/Sublime-Skin-BB-Cream-6-in-1/Yves-Rocher/BB-Cream', timeout=30)
tr_pg = html.fromstring(pg.content)

cm_pg = tr_pg.xpath('//p[@class="break-word"]/text()')
for cm in cm_pg:
    print cm

I got this error

Traceback (most recent call last):
  File "/Users/ghozan/PycharmProjects/MakeupAlley/main.py", line 22, in <module>
    process_page('/product/showreview.asp/ItemId=164662/Sublime-Skin-BB-Cream-6-in-1/Yves-Rocher/BB-Cream')
  File "/Users/ghozan/PycharmProjects/MakeupAlley/main.py", line 10, in process_page
    cm_pg = tr_pg.xpath('//p[@class="break-word"]/text()')
  File "src/lxml/lxml.etree.pyx", line 1587, in lxml.etree._Element.xpath (src/lxml/lxml.etree.c:57884)
  File "src/lxml/xpath.pxi", line 307, in lxml.etree.XPathElementEvaluator.__call__ (src/lxml/lxml.etree.c:166905)
  File "src/lxml/xpath.pxi", line 230, in lxml.etree._XPathEvaluatorBase._handle_result (src/lxml/lxml.etree.c:165893)
  File "src/lxml/extensions.pxi", line 623, in lxml.etree._unwrapXPathObject (src/lxml/lxml.etree.c:160088)
  File "src/lxml/extensions.pxi", line 657, in lxml.etree._createNodeSetResult (src/lxml/lxml.etree.c:160529)
  File "src/lxml/extensions.pxi", line 678, in lxml.etree._unpackNodeSetEntry (src/lxml/lxml.etree.c:160740)
  File "src/lxml/extensions.pxi", line 804, in lxml.etree._buildElementStringResult (src/lxml/lxml.etree.c:162214)
  File "src/lxml/apihelpers.pxi", line 1417, in lxml.etree.funicode (src/lxml/lxml.etree.c:29944)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xe0 in position 615: invalid continuation byte

I know that there is an invalid character in the comments. How do I solve this?

Answer 1

Can you ask Requests to attempt to decode it for you? Use response.text (a string) rather than response.content (bytes).

The encoding of the source is probably something other than UTF-8, which your XPath library might be assuming. response.encoding is Requests best guess at what it is. Sometimes web servers/pages aren't configured to explicitly say what encoding they're using then all you can do is guess.

Doesn't help that encoding can be specified in an HTTP header and/or in a <meta> tag. Or websites can lie. Or they might mixing encodings. Note that your target website can't even validate because the encoding is wrong, and even with that it's rife with errors.

Answer 2

The page have badly encoded characters.
Ex:

Voil�! You will now have an airbrushed look.[...](� la Cover Girl!)

You can avoid them by manually decoding:

>>> pg.content.decode('utf8', errors='ignore')
u'Voil! You will now have an airbrushed look.[...]( la Cover Girl!)'

UnicodeDecodeError when extracting comments from a web page using lxml and xpath

Question

2 answers

solution1
0 2016-12-13 16:08:32

solution2
0 2016-12-13 16:21:53

UnicodeDecodeError when extracting comments from a web page using lxml and xpath

Question

2 answers

solution1 0 2016-12-13 16:08:32

solution2 0 2016-12-13 16:21:53

solution1
0 2016-12-13 16:08:32

solution2
0 2016-12-13 16:21:53