[英]Python error: 'utf8' codec can't decode byte 0x92 in position 85: invalid start byte
I am using python2.7 and lxml. 我正在使用python2.7和lxml。 My code is as below
我的代码如下
import urllib
from lxml import html
def get_value(el):
return get_text(el, 'value') or el.text_content()
response = urllib.urlopen('http://www.edmunds.com/dealerships/Texas/Frisco/DavidMcDavidHondaofFrisco/fullsales-504210667.html').read()
dom = html.fromstring(response)
try:
description = get_value(dom.xpath("//div[@class='description item vcard']")[0].xpath(".//p[@class='sales-review-paragraph loose-spacing']")[0])
except IndexError, e:
description = ''
The code crashes inside the try, giving an error 尝试中的代码崩溃,给出错误
UnicodeDecodeError at /
'utf8' codec can't decode byte 0x92 in position 85: invalid start byte
The string that could not be encoded/decoded was: ouldn t be 不能被编码/解码的字符串是:
I have tried using a lot of techniques including .encode('utf8'), but none does solve the problem. 我尝试使用许多技术,包括.encode('utf8'),但没有一个能解决问题。 I have 2 question:
我有2个问题:
The page is being served up with charset=ISO-8859-1
. 该页面由
charset=ISO-8859-1
。 Decode from that to unicode. 从此解码为unicode。
[ [
Your except clause only handles exceptions of the IndexError type. 您的except子句仅处理IndexError类型的异常。 The problem was a UnicodeDecodeError, which is not an IndexError - so the exception is not handled by that except clause.
问题是UnicodeDecodeError,它不是IndexError-因此,该异常子句不处理该异常。
It's also not clear what 'get_value' does, and that may well be where the actual problem is arising. 还不清楚'get_value'是做什么的,这很可能是实际问题发生的地方。
decode the response to unicode, properly handling errors (ignore on error) before parsing with fromhtml. 解码对unicode的响应,在使用fromhtml解析之前正确处理错误(忽略错误)。
catch the UnicodeDecodeError, or all errors. 捕获UnicodeDecodeError或所有错误。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.