[英]How can i write my custom link extractor in scrapy python
我想编写我的自定义scrapy链接提取器来提取链接。
scrapy文档说它有两个内置的提取器。
http://doc.scrapy.org/en/latest/topics/link-extractors.html
但我还没有看到任何代码示例如何通过自定义链接提取器实现,有人可以给出一些编写自定义提取器的示例吗?
这是自定义链接提取器的示例
class RCP_RegexLinkExtractor(SgmlLinkExtractor):
"""High performant link extractor"""
def _extract_links(self, response_text, response_url, response_encoding, base_url=None):
if base_url is None:
base_url = urljoin(response_url, self.base_url) if self.base_url else response_url
clean_url = lambda u: urljoin(base_url, remove_entities(clean_link(u.decode(response_encoding))))
clean_text = lambda t: replace_escape_chars(remove_tags(t.decode(response_encoding))).strip()
links_text = linkre.findall(response_text)
urlstext = set([(clean_url(url), clean_text(text)) for url, _, text in links_text])
return [Link(url, text) for url, text in urlstext]
用法
rules = (
Rule(
RCP_RegexLinkExtractor(
allow=(r"epolls/2012/president/[a-z]{2}/[a-z]+_romney_vs_obama-[0-9]{4}\.html"),
# Regex explanation:
# [a-z]{2} - matches a two character state abbreviation
# [a-z]* - matches a state name
# [0-9]{4} - matches a 4 number unique webpage identifier
allow_domains=('realclearpolitics.com',),
),
callback='parseStatePolls',
# follow=None, # default
process_links='processLinks',
process_request='processRequest',
),
)
我很难找到最近的例子,所以我决定发布编写自定义链接提取器的过程。
我有一个爬网网站的问题,该网站的href网址有空格,制表符和换行符,如下所示:
<a href="
/something/something.html
" />
假设有此链接的页面位于:
http://example.com/something/page.html
而不是将此href url转换为:
http://example.com/something/something.html
Scrapy将其转化为:
这导致了一个无限循环,因为爬虫会越来越深入地解释那些糟糕解释的网址。
我尝试使用LxmlLinkExtractor
的process_value
和process_links
参数, 这里没有运气,所以我决定修补处理相对URL的方法。
在当前版本的Scrapy(1.0.3)中,推荐的链接提取器是LxmlLinkExtractor
。
如果要扩展LxmlLinkExtractor
,您应该查看代码在您使用的Scrapy版本上的运行方式。
您可以通过从命令行(在OS X上)运行来打开当前使用的scrapy代码位置:
open $(python -c 'import site; print site.getsitepackages()[0] + "/scrapy"')
在我使用的版本(1.0.3)中, LxmlLinkExtractor
的代码位于:
scrapy/linkextractors/lxmlhtml.py
在那里,我看到了,我需要去适应的方法是_extract_links()
内LxmlParserLinkExtractor
,即随后被LxmlLinkExtractor
。
所以我延长LxmlLinkExtractor
和LxmlParserLinkExtractor
一个名为稍微修改类CustomLinkExtractor
和CustomLxmlParserLinkExtractor
。 我修改的单行被注释掉了。
# Import everything from the original lxmlhtml
from scrapy.linkextractors.lxmlhtml import *
_collect_string_content = etree.XPath("string()")
# Extend LxmlParserLinkExtractor
class CustomParserLinkExtractor(LxmlParserLinkExtractor):
def _extract_links(self, selector, response_url, response_encoding, base_url):
links = []
for el, attr, attr_val in self._iter_links(selector._root):
# Original method was:
# attr_val = urljoin(base_url, attr_val)
# So I just added a .strip()
attr_val = urljoin(base_url, attr_val.strip())
url = self.process_attr(attr_val)
if url is None:
continue
if isinstance(url, unicode):
url = url.encode(response_encoding)
# to fix relative links after process_value
url = urljoin(response_url, url)
link = Link(url, _collect_string_content(el) or u'',
nofollow=True if el.get('rel') == 'nofollow' else False)
links.append(link)
return unique_list(links, key=lambda link: link.url) \
if self.unique else links
# Extend LxmlLinkExtractor
class CustomLinkExtractor(LxmlLinkExtractor):
def __init__(self, allow=(), deny=(), allow_domains=(), deny_domains=(), restrict_xpaths=(),
tags=('a', 'area'), attrs=('href',), canonicalize=True,
unique=True, process_value=None, deny_extensions=None, restrict_css=()):
tags, attrs = set(arg_to_iter(tags)), set(arg_to_iter(attrs))
tag_func = lambda x: x in tags
attr_func = lambda x: x in attrs
# Here I replaced the original LxmlParserLinkExtractor with my CustomParserLinkExtractor
lx = CustomParserLinkExtractor(tag=tag_func, attr=attr_func,
unique=unique, process=process_value)
super(LxmlLinkExtractor, self).__init__(lx, allow=allow, deny=deny,
allow_domains=allow_domains, deny_domains=deny_domains,
restrict_xpaths=restrict_xpaths, restrict_css=restrict_css,
canonicalize=canonicalize, deny_extensions=deny_extensions)
在定义规则时,我使用CustomLinkExtractor
:
from scrapy.spiders import Rule
rules = (
Rule(CustomLinkExtractor(canonicalize=False, allow=[('^https?\:\/\/example\.com\/something\/.*'),]), callback='parse_item', follow=True),
)
我在https://github.com/geekan/scrapy-examples和https://github.com/mjhea0/Scrapy-Samples上找到了LinkExtractor示例
(人们在上面的链接中找不到所需信息后编辑)
更确切地说,在https://github.com/geekan/scrapy-examples/search?utf8=%E2%9C%93&q=linkextractors&type=Code和https://github.com/mjhea0/Scrapy-Samples/search?utf8= %E2%9C%93&q = linkextractors
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.