如何在scrapy python中编写自定义链接提取器

Question

我想编写我的自定义scrapy链接提取器来提取链接。

scrapy文档说它有两个内置的提取器。

http://doc.scrapy.org/en/latest/topics/link-extractors.html

但我还没有看到任何代码示例如何通过自定义链接提取器实现，有人可以给出一些编写自定义提取器的示例吗？

Answer 1

这是自定义链接提取器的示例

class RCP_RegexLinkExtractor(SgmlLinkExtractor):
    """High performant link extractor"""

    def _extract_links(self, response_text, response_url, response_encoding, base_url=None):
        if base_url is None:
            base_url = urljoin(response_url, self.base_url) if self.base_url else response_url

        clean_url = lambda u: urljoin(base_url, remove_entities(clean_link(u.decode(response_encoding))))
        clean_text = lambda t: replace_escape_chars(remove_tags(t.decode(response_encoding))).strip()

        links_text = linkre.findall(response_text)
        urlstext = set([(clean_url(url), clean_text(text)) for url, _, text in links_text])

        return [Link(url, text) for url, text in urlstext]

用法

rules = (
    Rule(
        RCP_RegexLinkExtractor(
            allow=(r"epolls/2012/president/[a-z]{2}/[a-z]+_romney_vs_obama-[0-9]{4}\.html"),
            # Regex explanation:
            #     [a-z]{2} - matches a two character state abbreviation
            #     [a-z]*   - matches a state name
            #     [0-9]{4} - matches a 4 number unique webpage identifier

            allow_domains=('realclearpolitics.com',),
        ),
        callback='parseStatePolls',
        # follow=None, # default 
        process_links='processLinks',
        process_request='processRequest',
    ),
)

请看这里https://github.com/jtfairbank/RCP-Poll-Scraper

Answer 2

我很难找到最近的例子，所以我决定发布编写自定义链接提取器的过程。

我之所以决定创建一个自定义链接提取器

我有一个爬网网站的问题，该网站的href网址有空格，制表符和换行符，如下所示：

<a href="
       /something/something.html
         " />

假设有此链接的页面位于：

http://example.com/something/page.html

而不是将此href url转换为：

http://example.com/something/something.html

Scrapy将其转化为：

http://example.com/something%0A%20%20%20%20%20%20%20/something/something.html%0A%20%20%20%20%20%20%20

这导致了一个无限循环，因为爬虫会越来越深入地解释那些糟糕解释的网址。

我尝试使用LxmlLinkExtractor的process_value和process_links参数，这里没有运气，所以我决定修补处理相对URL的方法。

查找原始代码

在当前版本的Scrapy（1.0.3）中，推荐的链接提取器是LxmlLinkExtractor 。

如果要扩展LxmlLinkExtractor ，您应该查看代码在您使用的Scrapy版本上的运行方式。

您可以通过从命令行（在OS X上）运行来打开当前使用的scrapy代码位置：

open $(python -c 'import site; print site.getsitepackages()[0] + "/scrapy"')

在我使用的版本（1.0.3）中， LxmlLinkExtractor的代码位于：

scrapy/linkextractors/lxmlhtml.py

在那里，我看到了，我需要去适应的方法是_extract_links()内LxmlParserLinkExtractor ，即随后被LxmlLinkExtractor 。

所以我延长LxmlLinkExtractor和LxmlParserLinkExtractor一个名为稍微修改类CustomLinkExtractor和CustomLxmlParserLinkExtractor 。 我修改的单行被注释掉了。

# Import everything from the original lxmlhtml
from scrapy.linkextractors.lxmlhtml import *
_collect_string_content = etree.XPath("string()")

# Extend LxmlParserLinkExtractor
class CustomParserLinkExtractor(LxmlParserLinkExtractor):

    def _extract_links(self, selector, response_url, response_encoding, base_url):
        links = []
        for el, attr, attr_val in self._iter_links(selector._root):

            # Original method was:
            # attr_val = urljoin(base_url, attr_val)
            # So I just added a .strip()

            attr_val = urljoin(base_url, attr_val.strip())

            url = self.process_attr(attr_val)
            if url is None:
                continue
            if isinstance(url, unicode):
                url = url.encode(response_encoding)
            # to fix relative links after process_value
            url = urljoin(response_url, url)
            link = Link(url, _collect_string_content(el) or u'',
                nofollow=True if el.get('rel') == 'nofollow' else False)
            links.append(link)

        return unique_list(links, key=lambda link: link.url) \
                if self.unique else links


# Extend LxmlLinkExtractor
class CustomLinkExtractor(LxmlLinkExtractor):

    def __init__(self, allow=(), deny=(), allow_domains=(), deny_domains=(), restrict_xpaths=(),
                 tags=('a', 'area'), attrs=('href',), canonicalize=True,
                 unique=True, process_value=None, deny_extensions=None, restrict_css=()):
        tags, attrs = set(arg_to_iter(tags)), set(arg_to_iter(attrs))
        tag_func = lambda x: x in tags
        attr_func = lambda x: x in attrs

        # Here I replaced the original LxmlParserLinkExtractor with my CustomParserLinkExtractor
        lx = CustomParserLinkExtractor(tag=tag_func, attr=attr_func,
            unique=unique, process=process_value)

        super(LxmlLinkExtractor, self).__init__(lx, allow=allow, deny=deny,
            allow_domains=allow_domains, deny_domains=deny_domains,
            restrict_xpaths=restrict_xpaths, restrict_css=restrict_css,
            canonicalize=canonicalize, deny_extensions=deny_extensions)

在定义规则时，我使用CustomLinkExtractor ：

from scrapy.spiders import Rule


rules = (

    Rule(CustomLinkExtractor(canonicalize=False, allow=[('^https?\:\/\/example\.com\/something\/.*'),]), callback='parse_item', follow=True),

  )

Answer 3

我在https://github.com/geekan/scrapy-examples和https://github.com/mjhea0/Scrapy-Samples上找到了LinkExtractor示例

（人们在上面的链接中找不到所需信息后编辑）

更确切地说，在https://github.com/geekan/scrapy-examples/search?utf8=%E2%9C%93&q=linkextractors&type=Code和https://github.com/mjhea0/Scrapy-Samples/search?utf8= ％E2％9C％93＆q = linkextractors

如何在scrapy python中编写自定义链接提取器

问题描述

3 个解决方案

解决方案1
6 2013-01-25 05:37:03

解决方案2
2 2015-10-22 08:54:21

我之所以决定创建一个自定义链接提取器

查找原始代码

解决方案3
0 2014-08-11 14:55:29

如何在scrapy python中编写自定义链接提取器

问题描述

3 个解决方案

解决方案1 6 2013-01-25 05:37:03

解决方案2 2 2015-10-22 08:54:21

我之所以决定创建一个自定义链接提取器

查找原始代码

解决方案3 0 2014-08-11 14:55:29

解决方案1
6 2013-01-25 05:37:03

解决方案2
2 2015-10-22 08:54:21

解决方案3
0 2014-08-11 14:55:29