[英]Scrapy python : unicode links error
link encoding 链接编码
when scraping a site scrapy extracts links containing &amd and throws excption: Do not instantiate Link objects with unicode urls. 抓取网站时,scrazy会提取包含&amd的链接并抛出异常:不要使用unicode url实例化Link对象。 Assuming utf-8 encoding (which could be wrong) so how can i fix this error? 假设utf-8编码(可能是错误的),那么如何解决此错误?
I had the same problem with this character →
inserted on some links. 我对此字符有相同的问题→
插入了某些链接。 I found this related commit on github and than used this advice to write a file link_extractors.py
with: 我在github上找到了这个相关的提交 ,然后用这个建议写了一个文件link_extractors.py
:
from scrapy.selector import HtmlXPathSelector
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.utils.response import get_base_url
class CustomLinkExtractor(SgmlLinkExtractor):
"""Need this to fix the encoding error."""
def extract_links(self, response):
base_url = None
if self.restrict_xpaths:
hxs = HtmlXPathSelector(response)
base_url = get_base_url(response)
body = u''.join(f for x in self.restrict_xpaths
for f in hxs.select(x).extract())
try:
body = body.encode(response.encoding)
except UnicodeEncodeError:
body = body.encode('utf-8')
else:
body = response.body
links = self._extract_links(body, response.url, response.encoding, base_url)
links = self._process_links(links)
return links
Afterwards I used it in my spiders.py
: 之后,我在我的spiders.py
使用了它:
rules = (
Rule(CustomLinkExtractor(allow=('/gp/offer-listing*', ),
restrict_xpaths=("//li[contains(@class,'a-last')]/a", )),
callback='parse_start_url', follow=True,
),
)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.