Scrapy搜寻器遵循包含关键字的链接

Question

I have a scrapy webcrawler that is working really well. 我有一个抓狂的网络抓取工具，运行得非常好。 However, I would like to make it follow only links that contain a certain keyword or phrase. 但是，我只希望它遵循包含特定关键字或短语的链接。 I thought I had it figured out, but my output was not correct. 我以为我知道了，但是我的输出不正确。

from scrapy.selector import HtmlXPathSelector
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from scrapy.utils.url import urljoin_rfc
from webcrawler.items import SitegraphItem


class GraphspiderSpider(CrawlSpider):
    name = "examplespider"
    custom_settings = {
    'DEPTH_LIMIT': '2',
    }
    allowed_domains = []
    start_urls = (
        'http://www.example.com/products/',
    )

    rules = (
        Rule(LinkExtractor(allow=r'/'), callback='parse_item', follow=True),
    )

    def parse_item(self, response):
        hxs = HtmlXPathSelector(response)
        i = SitegraphItem()
        i['url'] = response.url
        # i['http_status'] = response.status
        llinks=[]
        for anchor in hxs.select('//a[text()="keyword"]/@href'):
            href=anchor.select('@href').extract()[0]
            if not href.lower().startswith("javascript"):
                llinks.append(urljoin_rfc(response.url,href))
        i['linkedurls'] = llinks
        return i

    def _response_downloaded(self, response):
        filename = response.url.split("/")[-1] + '.html'
        with open(filename, 'wb') as f:
            f.write(response.body)    
        rule = self._rules[response.meta['rule']]
        return self._parse_response(response, rule.callback, rule.cb_kwargs, rule.follow)

I added the "keyword" statement in the hxs.select segment but that clearly was not correct. 我在hxs.select段中添加了“关键字”语句，但这显然是不正确的。 I'm not sure how to get the keyword to work right. 我不确定如何正确使用关键字。

Answer 1

See if you can implement your link-filtering logic using LinkExtractor attributes. 查看是否可以使用LinkExtractor属性实现链接过滤逻辑。

Otherwise, use Spider instead of CrawlSpider . 否则，请使用Spider而不是CrawlSpider 。 CrawlSpider is only useful for the limited use cases it supports; CrawlSpider仅在其支持的有限用例中有用。 Spider works for all use cases. Spider适用于所有用例。

Scrapy搜寻器遵循包含关键字的链接

问题描述

1 个解决方案

解决方案1
0 2019-09-05 07:58:54

Scrapy搜寻器遵循包含关键字的链接

问题描述

1 个解决方案

解决方案1 0 2019-09-05 07:58:54

解决方案1
0 2019-09-05 07:58:54