[英]HTMLParser or urllib2 unicode issue
我正在嘗試使用HTMLParser和urllib2來獲取圖像文件
content = urllib2.urlopen( imgurl.encode('utf-8') ).read()
try:
p = MyHTMLParser( )
p.feed( content )
p.download_file( )
p.close()
except Exception,e:
print e
MyHTMLParser:
class MyHTMLParser(HTMLParser):
def __init__(self):
HTMLParser.__init__(self)
self.url=""
self.outfile = "some.png"
def download_file(self):
urllib.urlretrieve( self.url, self.outfile )
def handle_starttag(self, tag, attrs):
if tag == "a":
# after some manipulation here, self.url will have a img url
self.url = "http://somewhere.com/Fondue%C3%A0.png"
當我運行腳本時,我得到
Traceback (most recent call last):
File "test.py", line 59, in <module>
p.feed( data )
File "/usr/lib/python2.7/HTMLParser.py", line 114, in feed
self.goahead(0)
File "/usr/lib/python2.7/HTMLParser.py", line 158, in goahead
k = self.parse_starttag(i)
File "/usr/lib/python2.7/HTMLParser.py", line 305, in parse_starttag
attrvalue = self.unescape(attrvalue)
File "/usr/lib/python2.7/HTMLParser.py", line 472, in unescape
return re.sub(r"&(#?[xX]?(?:[0-9a-fA-F]+|\w{1,8}));", replaceEntities, s)
File "/usr/lib/python2.7/re.py", line 151, in sub
return _compile(pattern, flags).sub(repl, string, count)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 56: ordinal not in range(128)
使用發現的建議,我執行了.encode('utf-8')方法,但仍然給我錯誤。 如何解決呢? 謝謝
更換
content = urllib2.urlopen( url.encode('utf-8') ).read()
同
content = urllib2.urlopen(url).read().decode('utf-8')
將響應解碼為unicode。
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.