简体   繁体   English

停止Scrapy抓取相同的URL

[英]Stop Scrapy crawling the same URLs

I've written a basic Scrapy spider to crawl a website which seems to run fine other than the fact it doesn't want to stop, ie it keeps revisiting the same urls and returning the same content - I always end up having to stop it. 我已经编写了一个基本的Scrapy蜘蛛来抓取一个似乎运行良好的网站,除了它不想停止的事实,即它不断重访相同的网址并返回相同的内容 - 我总是最终不得不停止它。 I suspect it's going over the same urls over and over again. 我怀疑它一遍又一遍地在同一个网址上。 Is there a rule that will stop this? 是否有规则可以阻止这种情况? Or is there something else I have to do? 或者我还有什么需要做的吗? Maybe middleware? 也许是中间件?

The Spider is as below: 蜘蛛如下:

class LsbuSpider(CrawlSpider):
name = "lsbu6"
allowed_domains = ["lsbu.ac.uk"]
start_urls = [
    "http://www.lsbu.ac.uk"
]
rules = [
    Rule(SgmlLinkExtractor(allow=['lsbu.ac.uk/business-and-partners/.+']), callback='parse_item', follow=True),
]

def parse_item(self, response):
    join = Join()
    sel = Selector(response)
    bits = sel.xpath('//*')
    scraped_bits = []            
    for bit in bits:
        scraped_bit = LsbuItem()
        scraped_bit['title'] = scraped_bit.xpath('//title/text()').extract()
        scraped_bit['desc'] = join(bit.xpath('//*[@id="main_content_main_column"]//text()').extract()).strip()
        scraped_bits.append(scraped_bit)

    return scraped_bits

My settings.py file looks like this 我的settings.py文件看起来像这样

BOT_NAME = 'lsbu6'
DUPEFILTER_CLASS = 'scrapy.dupefilter.RFPDupeFilter'
DUPEFILTER_DEBUG = True
SPIDER_MODULES = ['lsbu.spiders']
NEWSPIDER_MODULE = 'lsbu.spiders'

Any help/ guidance/ instruction on stopping it running continuously would be greatly appreciated... 任何关于停止连续运行的帮助/指导/指导都将不胜感激......

As I'm a newbie to this; 因为我是这个的新手; any comments on tidying the code up would also be helpful (or links to good instruction). 任何关于整理代码的评论也会有所帮助(或指向良好指令的链接)。

Thanks... 谢谢...

The DupeFilter is enabled by default: http://doc.scrapy.org/en/latest/topics/settings.html#dupefilter-class and it's based on the request url. DupeFilter默认启用: http ://doc.scrapy.org/en/latest/topics/settings.html#dupefilter-class,它基于请求网址。

I tried a simplified version of your spider on a new vanilla scrapy project without any custom configuration. 我在一个新的vanilla scrapy项目中尝试了一个简化版本的蜘蛛,没有任何自定义配置。 The dupefilter worked and the crawl stopped after a few requests. dupefilter工作,并在几次请求后停止了爬行。 I'd say you have something wrong on your settings or on your scrapy version. 我说你的设置或scrapy版本有问题。 I'd suggest you to upgrade to scrapy 1.0, just to be sure :) 我建议你升级到scrapy 1.0,只是为了确定:)

$ pip install scrapy --pre

The simplified spider I tested: 我测试的简化蜘蛛:

from scrapy.spiders import CrawlSpider
from scrapy.linkextractors import LinkExtractor
from scrapy import Item, Field
from scrapy.spiders import Rule 

class LsbuItem(Item):
    title = Field()
    url = Field()

class LsbuSpider(CrawlSpider):
    name = "lsbu6"
    allowed_domains = ["lsbu.ac.uk"]

    start_urls = [
        "http://www.lsbu.ac.uk"
    ]    

    rules = [
        Rule(LinkExtractor(allow=['lsbu.ac.uk/business-and-partners/.+']), callback='parse_item', follow=True),
    ]    

    def parse_item(self, response):
        scraped_bit = LsbuItem()
        scraped_bit['url'] = response.url
        yield scraped_bit

Your design makes the crawl go in circles. 您的设计使抓取进入圈内。 For examples, there is a page http://www.lsbu.ac.uk/business-and-partners/business , which when opened contains the link to " http://www.lsbu.ac.uk/business-and-partners/partners , and that one contains again the link to the first one. Thus, you go in circles indefinitely. 例如,有一个页面http://www.lsbu.ac.uk/business-and-partners/business ,打开时包含指向“ http://www.lsbu.ac.uk/business-and ”的链接- 合作伙伴/合作伙伴 ,然后再包含第一个链接。因此,您无限期地进入圈子。

In order to overcome this, you need to create better rules, eliminating the circular references. 为了克服这个问题,您需要创建更好的规则,从而消除循环引用。 And also, you have two identical rules defined, which is not needed. 而且,您定义了两个相同的规则,这是不需要的。 If you want the follow you can always put it on the same rule, you don't need a new rule. 如果您想要follow您可以始终将其放在同一规则上,您不需要新规则。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM