简体   繁体   中英

UnicodeDecodeError when extracting comments from a web page using lxml and xpath

Hi I am trying to extract comments on a web page using lxml and xpath. Here is my code:

pg = requests.get('https://www.makeupalley.com/product/showreview.asp/ItemId=164662/Sublime-Skin-BB-Cream-6-in-1/Yves-Rocher/BB-Cream', timeout=30)
tr_pg = html.fromstring(pg.content)

cm_pg = tr_pg.xpath('//p[@class="break-word"]/text()')
for cm in cm_pg:
    print cm

I got this error

Traceback (most recent call last):
  File "/Users/ghozan/PycharmProjects/MakeupAlley/main.py", line 22, in <module>
    process_page('/product/showreview.asp/ItemId=164662/Sublime-Skin-BB-Cream-6-in-1/Yves-Rocher/BB-Cream')
  File "/Users/ghozan/PycharmProjects/MakeupAlley/main.py", line 10, in process_page
    cm_pg = tr_pg.xpath('//p[@class="break-word"]/text()')
  File "src/lxml/lxml.etree.pyx", line 1587, in lxml.etree._Element.xpath (src/lxml/lxml.etree.c:57884)
  File "src/lxml/xpath.pxi", line 307, in lxml.etree.XPathElementEvaluator.__call__ (src/lxml/lxml.etree.c:166905)
  File "src/lxml/xpath.pxi", line 230, in lxml.etree._XPathEvaluatorBase._handle_result (src/lxml/lxml.etree.c:165893)
  File "src/lxml/extensions.pxi", line 623, in lxml.etree._unwrapXPathObject (src/lxml/lxml.etree.c:160088)
  File "src/lxml/extensions.pxi", line 657, in lxml.etree._createNodeSetResult (src/lxml/lxml.etree.c:160529)
  File "src/lxml/extensions.pxi", line 678, in lxml.etree._unpackNodeSetEntry (src/lxml/lxml.etree.c:160740)
  File "src/lxml/extensions.pxi", line 804, in lxml.etree._buildElementStringResult (src/lxml/lxml.etree.c:162214)
  File "src/lxml/apihelpers.pxi", line 1417, in lxml.etree.funicode (src/lxml/lxml.etree.c:29944)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xe0 in position 615: invalid continuation byte

I know that there is an invalid character in the comments. How do I solve this?

Can you ask Requests to attempt to decode it for you? Use response.text (a string) rather than response.content (bytes).

The encoding of the source is probably something other than UTF-8, which your XPath library might be assuming. response.encoding is Requests best guess at what it is. Sometimes web servers/pages aren't configured to explicitly say what encoding they're using then all you can do is guess.

Doesn't help that encoding can be specified in an HTTP header and/or in a <meta> tag. Or websites can lie. Or they might mixing encodings. Note that your target website can't even validate because the encoding is wrong, and even with that it's rife with errors.

The page have badly encoded characters.
Ex:

Voil�! You will now have an airbrushed look.[...](� la Cover Girl!)

You can avoid them by manually decoding:

>>> pg.content.decode('utf8', errors='ignore')
u'Voil! You will now have an airbrushed look.[...]( la Cover Girl!)'

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM