简体   繁体   English

只捕获一次URL的Scrapy蜘蛛

[英]Scrapy spider that only crawls URLs once

I am writing a Scrapy spider that crawls a set of URLs once per day. 我正在编写一个Scrapy蜘蛛,每天抓取一组URL。 However, some of these websites are very big, so I cannot crawl the full site daily, nor would I want to generate the massive traffic necessary to do so. 但是,其中一些网站非常大,所以我不能每天抓取整个网站,也不想产生必要的大量流量。

An old question ( here ) asked something similar. 一个老问题( 这里 )问了类似的问题。 However, the upvoted response simply points to a code snippet ( here ), which seems to require something of the request instance, though that is not explained in the response, nor on the page containing the code snippet. 但是,upvoted响应只是指向一个代码片段( 这里 ),它似乎需要一些请求实例,尽管在响应中没有解释,也没有在包含代码片段的页面上解释。

I'm trying to make sense of this but find middleware a bit confusing. 我试图理解这一点,但发现中间件有点令人困惑。 A complete example of a scraper which can be be run multiple times without rescraping URLs would be very useful, whether or not it uses the linked middleware. 无论是否使用链接的中间件,一个可以多次运行而不重新编写URL的刮刀的完整示例将非常有用。

I've posted code below to get the ball rolling but I don't necessarily need to use this middleware. 我已经发布了下面的代码来推动这项工作,但我不一定需要使用这个中间件。 Any scrapy spider that can crawl daily and extract new URLs will do. 任何可以每天抓取并提取新网址的scrapy蜘蛛都可以。 Obviously one solution is to just write out a dictionary of scraped URLs and then check to confirm that each new URL is/isn't in the dictionary, but that seems very slow/inefficient. 显然,一种解决方案是只写出一个已删除URL的字典,然后检查以确认每个新URL是否在字典中,但这似乎非常慢/效率低。

Spider 蜘蛛

from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors import LinkExtractor
from cnn_scrapy.items import NewspaperItem



class NewspaperSpider(CrawlSpider):
    name = "newspaper"
    allowed_domains = ["cnn.com"]
    start_urls = [
        "http://www.cnn.com/"
    ]

    rules = (
        Rule(LinkExtractor(), callback="parse_item", follow=True),
    )

    def parse_item(self, response):
        self.log("Scraping: " + response.url)
        item = NewspaperItem()
        item["url"] = response.url
        yield item

Items 项目

import scrapy


class NewspaperItem(scrapy.Item):
    url = scrapy.Field()
    visit_id = scrapy.Field()
    visit_status = scrapy.Field()

Middlewares (ignore.py) 中间件(ignore.py)

from scrapy import log
from scrapy.http import Request
from scrapy.item import BaseItem
from scrapy.utils.request import request_fingerprint

from cnn_scrapy.items import NewspaperItem

class IgnoreVisitedItems(object):
    """Middleware to ignore re-visiting item pages if they were already visited
    before. The requests to be filtered by have a meta['filter_visited'] flag
    enabled and optionally define an id to use for identifying them, which
    defaults the request fingerprint, although you'd want to use the item id,
    if you already have it beforehand to make it more robust.
    """

    FILTER_VISITED = 'filter_visited'
    VISITED_ID = 'visited_id'
    CONTEXT_KEY = 'visited_ids'

    def process_spider_output(self, response, result, spider):
        context = getattr(spider, 'context', {})
        visited_ids = context.setdefault(self.CONTEXT_KEY, {})
        ret = []
        for x in result:
            visited = False
            if isinstance(x, Request):
                if self.FILTER_VISITED in x.meta:
                    visit_id = self._visited_id(x)
                    if visit_id in visited_ids:
                        log.msg("Ignoring already visited: %s" % x.url,
                                level=log.INFO, spider=spider)
                        visited = True
            elif isinstance(x, BaseItem):
                visit_id = self._visited_id(response.request)
                if visit_id:
                    visited_ids[visit_id] = True
                    x['visit_id'] = visit_id
                    x['visit_status'] = 'new'
            if visited:
                ret.append(NewspaperItem(visit_id=visit_id, visit_status='old'))
            else:
                ret.append(x)
        return ret

    def _visited_id(self, request):
        return request.meta.get(self.VISITED_ID) or request_fingerprint(request)

Here's the thing, what you want to do is to be able to have one database of which your crawl is scheduled/croned. 这就是你要做的事情,就是能够有一个数据库,你的爬网被安排/ croned。 dupflier.middleware or not your still having to scrape the entire site regardless... and I feel despite the obviousness that the code provided cant be the entire project, that that WAYY too much code. dupflier.middleware或不是你仍然不得不刮掉整个网站...而且我觉得尽管显而易见的是代码提供的不是整个项目,那就是WAYY太多的代码。

I'm not exactly sure what it is that you were scraping but I'm going to assume right now you have CNN as the projects URL that you're scraping articles? 我不确定你是在抓什么,但我现在假设你有CNN作为你正在抓文章的项目网址?

what I would do would be to use CNNs RSS feeds or even site map given that provides due date with the article meta and using the OS module... 我会做的是使用CNNs RSS提要甚至网站地图给出提供与文章meta和使用OS模块的截止日期...

Define the date each crawl instance Using regex restrict the itemization with the crawlers defined date against the date articles posted deploy and schedule crawl to/in scrapinghub Use scrapinghubs python api client to iterate through items 定义每个爬网实例的日期使用正则表达式将爬行程序定义的日期限制为发布部署的日期和调度爬行到/在scrapinghub中的日期使用scrapinghubs python api client迭代项目

Still would crawl entire sites content but with a xmlspider or rssspider class is perfect for parsing all that data more quickly... And now that the db is available in a "cloud" ... I feel one could be more modular with the scale-ability of the project as well much easier portability/cross-compatibility 仍然会抓取整个网站内容,但使用xmlspider或rssspider类非常适合更快地解析所有数据...现在,数据库可以在“云”中使用...我觉得可以更加模块化的规模项目的可用性以及更容易的可移植性/交叉兼容性

Im sure the flow im describing would be subject to some tinkering but the idea is straight forward. 我确定描述的流程会受到一些修补,但这个想法是直截了当的。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM