简体   繁体   English

Scrapy搜寻器遵循包含关键字的链接

[英]Scrapy crawler to follow links containing keywords

I have a scrapy webcrawler that is working really well. 我有一个抓狂的网络抓取工具,运行得非常好。 However, I would like to make it follow only links that contain a certain keyword or phrase. 但是,我只希望它遵循包含特定关键字或短语的链接。 I thought I had it figured out, but my output was not correct. 我以为我知道了,但是我的输出不正确。

from scrapy.selector import HtmlXPathSelector
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from scrapy.utils.url import urljoin_rfc
from webcrawler.items import SitegraphItem


class GraphspiderSpider(CrawlSpider):
    name = "examplespider"
    custom_settings = {
    'DEPTH_LIMIT': '2',
    }
    allowed_domains = []
    start_urls = (
        'http://www.example.com/products/',
    )

    rules = (
        Rule(LinkExtractor(allow=r'/'), callback='parse_item', follow=True),
    )

    def parse_item(self, response):
        hxs = HtmlXPathSelector(response)
        i = SitegraphItem()
        i['url'] = response.url
        # i['http_status'] = response.status
        llinks=[]
        for anchor in hxs.select('//a[text()="keyword"]/@href'):
            href=anchor.select('@href').extract()[0]
            if not href.lower().startswith("javascript"):
                llinks.append(urljoin_rfc(response.url,href))
        i['linkedurls'] = llinks
        return i

    def _response_downloaded(self, response):
        filename = response.url.split("/")[-1] + '.html'
        with open(filename, 'wb') as f:
            f.write(response.body)    
        rule = self._rules[response.meta['rule']]
        return self._parse_response(response, rule.callback, rule.cb_kwargs, rule.follow)

I added the "keyword" statement in the hxs.select segment but that clearly was not correct. 我在hxs.select段中添加了“关键字”语句,但这显然是不正确的。 I'm not sure how to get the keyword to work right. 我不确定如何正确使用关键字。

See if you can implement your link-filtering logic using LinkExtractor attributes. 查看是否可以使用LinkExtractor属性实现链接过滤逻辑。

Otherwise, use Spider instead of CrawlSpider . 否则,请使用Spider而不是CrawlSpider CrawlSpider is only useful for the limited use cases it supports; CrawlSpider仅在其支持的有限用例中有用。 Spider works for all use cases. Spider适用于所有用例。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM