繁体   English   中英

如何在scrapy python中编写自定义链接提取器

[英]How can i write my custom link extractor in scrapy python

我想编写我的自定义scrapy链接提取器来提取链接。

scrapy文档说它有两个内置的提取器。

http://doc.scrapy.org/en/latest/topics/link-extractors.html

但我还没有看到任何代码示例如何通过自定义链接提取器实现,有人可以给出一些编写自定义提取器的示例吗?

这是自定义链接提取器的示例

class RCP_RegexLinkExtractor(SgmlLinkExtractor):
    """High performant link extractor"""

    def _extract_links(self, response_text, response_url, response_encoding, base_url=None):
        if base_url is None:
            base_url = urljoin(response_url, self.base_url) if self.base_url else response_url

        clean_url = lambda u: urljoin(base_url, remove_entities(clean_link(u.decode(response_encoding))))
        clean_text = lambda t: replace_escape_chars(remove_tags(t.decode(response_encoding))).strip()

        links_text = linkre.findall(response_text)
        urlstext = set([(clean_url(url), clean_text(text)) for url, _, text in links_text])

        return [Link(url, text) for url, text in urlstext]

用法

rules = (
    Rule(
        RCP_RegexLinkExtractor(
            allow=(r"epolls/2012/president/[a-z]{2}/[a-z]+_romney_vs_obama-[0-9]{4}\.html"),
            # Regex explanation:
            #     [a-z]{2} - matches a two character state abbreviation
            #     [a-z]*   - matches a state name
            #     [0-9]{4} - matches a 4 number unique webpage identifier

            allow_domains=('realclearpolitics.com',),
        ),
        callback='parseStatePolls',
        # follow=None, # default 
        process_links='processLinks',
        process_request='processRequest',
    ),
)

请看这里https://github.com/jtfairbank/RCP-Poll-Scraper

我很难找到最近的例子,所以我决定发布编写自定义链接提取器的过程。

我之所以决定创建一个自定义链接提取器

我有一个爬网网站的问题,该网站的href网址有空格,制表符和换行符,如下所示:

<a href="
       /something/something.html
         " />

假设有此链接的页面位于:

http://example.com/something/page.html

而不是将此href url转换为:

http://example.com/something/something.html

Scrapy将其转化为:

http://example.com/something%0A%20%20%20%20%20%20%20/something/something.html%0A%20%20%20%20%20%20%20

这导致了一个无限循环,因为爬虫会越来越深入地解释那些糟糕解释的网址。

我尝试使用LxmlLinkExtractorprocess_valueprocess_links参数, 这里没有运气,所以我决定修补处理相对URL的方法。

查找原始代码

在当前版本的Scrapy(1.0.3)中,推荐的链接提取器是LxmlLinkExtractor

如果要扩展LxmlLinkExtractor ,您应该查看代码在您使用的Scrapy版本上的运行方式。

您可以通过从命令行(在OS X上)运行来打开当前使用的scrapy代码位置:

open $(python -c 'import site; print site.getsitepackages()[0] + "/scrapy"')

在我使用的版本(1.0.3)中, LxmlLinkExtractor的代码位于:

scrapy/linkextractors/lxmlhtml.py

在那里,我看到了,我需要去适应的方法是_extract_links()LxmlParserLinkExtractor ,即随后被LxmlLinkExtractor

所以我延长LxmlLinkExtractorLxmlParserLinkExtractor一个名为稍微修改类CustomLinkExtractorCustomLxmlParserLinkExtractor 我修改的单行被注释掉了。

# Import everything from the original lxmlhtml
from scrapy.linkextractors.lxmlhtml import *
_collect_string_content = etree.XPath("string()")

# Extend LxmlParserLinkExtractor
class CustomParserLinkExtractor(LxmlParserLinkExtractor):

    def _extract_links(self, selector, response_url, response_encoding, base_url):
        links = []
        for el, attr, attr_val in self._iter_links(selector._root):

            # Original method was:
            # attr_val = urljoin(base_url, attr_val)
            # So I just added a .strip()

            attr_val = urljoin(base_url, attr_val.strip())

            url = self.process_attr(attr_val)
            if url is None:
                continue
            if isinstance(url, unicode):
                url = url.encode(response_encoding)
            # to fix relative links after process_value
            url = urljoin(response_url, url)
            link = Link(url, _collect_string_content(el) or u'',
                nofollow=True if el.get('rel') == 'nofollow' else False)
            links.append(link)

        return unique_list(links, key=lambda link: link.url) \
                if self.unique else links


# Extend LxmlLinkExtractor
class CustomLinkExtractor(LxmlLinkExtractor):

    def __init__(self, allow=(), deny=(), allow_domains=(), deny_domains=(), restrict_xpaths=(),
                 tags=('a', 'area'), attrs=('href',), canonicalize=True,
                 unique=True, process_value=None, deny_extensions=None, restrict_css=()):
        tags, attrs = set(arg_to_iter(tags)), set(arg_to_iter(attrs))
        tag_func = lambda x: x in tags
        attr_func = lambda x: x in attrs

        # Here I replaced the original LxmlParserLinkExtractor with my CustomParserLinkExtractor
        lx = CustomParserLinkExtractor(tag=tag_func, attr=attr_func,
            unique=unique, process=process_value)

        super(LxmlLinkExtractor, self).__init__(lx, allow=allow, deny=deny,
            allow_domains=allow_domains, deny_domains=deny_domains,
            restrict_xpaths=restrict_xpaths, restrict_css=restrict_css,
            canonicalize=canonicalize, deny_extensions=deny_extensions)

在定义规则时,我使用CustomLinkExtractor

from scrapy.spiders import Rule


rules = (

    Rule(CustomLinkExtractor(canonicalize=False, allow=[('^https?\:\/\/example\.com\/something\/.*'),]), callback='parse_item', follow=True),

  )

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM