Scrapy - 从一个链接获取多个 url

Question

Sample html:示例 html：

<div id="foobar" foo="hello;world;bar;baz">blablabla</div>

I'm using LinkExtractor to get the attribute foo , namely the string hello;world;bar;baz .我正在使用LinkExtractor来获取属性foo ，即字符串hello;world;bar;baz 。 I wonder if it's possible to turn this string into multiple urls for the spider to follow, like hello.com , world.com , etc.我想知道是否可以将此字符串转换为多个 url 供蜘蛛跟踪，例如hello.com 、 world.com等。

Any help is appreciated.任何帮助表示赞赏。

PS: the following might (or might not) be useful PS：以下可能（或可能没有）有用

process_value argument of LxmlLinkExtractor LxmlLinkExtractor process_value参数
process_links argument ofRules Rules process_links参数

Answer 1

The problem is that, if you are using built-in LinkExtractor , process_value callable has to return a single link - it would fail here if it's, in your case, a list of links.问题是，如果您使用内置LinkExtractor ， process_value callable 必须返回一个链接- 如果它是一个链接列表，它会在这里失败。

You would have to have a custom Parser Link Extractor which would support extracting multiple links per attribute, something like this (not tested):您必须有一个自定义解析器链接提取器，它支持为每个属性提取多个链接，如下所示（未测试）：

class MyParserLinkExtractor(LxmlParserLinkExtractor):
    def _extract_links(self, selector, response_url, response_encoding, base_url):
        links = []
        # hacky way to get the underlying lxml parsed document
        for el, attr, attr_val in self._iter_links(selector.root):
            # pseudo lxml.html.HtmlElement.make_links_absolute(base_url)
            try:
                attr_val = urljoin(base_url, attr_val)
            except ValueError:
                continue # skipping bogus links
            else:
                url = self.process_attr(attr_val)
                if url is None:
                    continue
            if isinstance(url, unicode):
                url = url.encode(response_encoding)

            # url here is a list
            for item in url:
                url = urljoin(response_url, item)
                link = Link(item, _collect_string_content(el) or u'',
                            nofollow=rel_has_nofollow(el.get('rel')))
                links.append(link)

        return unique_list(links, key=lambda link: link.url) \
                if self.unique else links

Then, based on it, define your actual Link Extractor:然后，基于它，定义您的实际链接提取器：

class MyLinkExtractor(LxmlLinkExtractor):
    def __init__(self, allow=(), deny=(), allow_domains=(), deny_domains=(), restrict_xpaths=(),
                 tags=('a', 'area'), attrs=('href',), canonicalize=True,
                 unique=True, process_value=None, deny_extensions=None, restrict_css=()):
        tags, attrs = set(arg_to_iter(tags)), set(arg_to_iter(attrs))
        tag_func = lambda x: x in tags
        attr_func = lambda x: x in attrs
        lx = MyParserLinkExtractor(tag=tag_func, attr=attr_func,
            unique=unique, process=process_value)

        super(LxmlLinkExtractor, self).__init__(lx, allow=allow, deny=deny,
            allow_domains=allow_domains, deny_domains=deny_domains,
            restrict_xpaths=restrict_xpaths, restrict_css=restrict_css,
            canonicalize=canonicalize, deny_extensions=deny_extensions)

You would then need to have tags , attrs and process_value defined:然后，您需要定义tags 、 attrs和process_value ：

MyLinkExtractor(tags=["div"], attrs=["foo"], process_value=extract_links)

where extract_links is defined as:其中extract_links定义为：

def extract_links(value):
    return ["https://{}.com".format(part) for part in value.split(";")]

Answer 2

this will work for u这对你有用

 def url_break(value):
    for url in value.split(';'):
        yield url

  class MyParserLinkExtractor(CrawlSpider):
        rules = [Rule(SgmlLinkExtractor(, restrict_xpaths=YOUR_XPATH_LIST, process_value=url_break)),]

Scrapy - 从一个链接获取多个 url

问题描述

2 个解决方案

解决方案1
2 已采纳 2015-09-10 15:14:35

解决方案2
0 2015-09-11 10:43:49

Scrapy - 从一个链接获取多个 url

问题描述

2 个解决方案

解决方案1 2 已采纳 2015-09-10 15:14:35

解决方案2 0 2015-09-11 10:43:49

解决方案1
2 已采纳 2015-09-10 15:14:35

解决方案2
0 2015-09-11 10:43:49