简体   繁体   English

如何修复 Scrapy 深度爬行不起作用

[英]how to fix Scrapy in depth crawling not working

I'm currently trying to create a small web scraping prototype using scrapy.我目前正在尝试使用scrapy创建一个小型的网络抓取原型。 My current issue is related to link extraction and following.我当前的问题与链接提取和关注有关。

I'm trying to make scrapy explore pages and find links to pages (not images and other content for now) but I don't know how to parameter it correctly.我正在尝试制作scrapy探索页面并找到页面链接(目前不是图像和其他内容),但我不知道如何正确参数化它。

This is the spider I'm using :这是我正在使用的蜘蛛:

class DefaultSpider(CrawlSpider):

    name = "default"
    session_id = -1
    rules = [Rule(LinkExtractor(allow=()),callback='parse', follow=True)]
        
def start_requests(self):
     #not relevent code that gives an URL list to be scrawled
     for url in listurl:

     #make scrapy follow only the current domain url.
     self.rules[0].allow=url
        yield scrapy.Request(url=url, callback=self.parse)

def parse(self, response):
     page = Website(response.url,response.text)
     DBInterface.store(page)

The spider doesnt seem to find any links in the pages.蜘蛛似乎没有在页面中找到任何链接。 I think I'm not doing it the proper way.我想我没有以正确的方式做这件事。 I tried to another function as a callback instead of the parse method.我尝试将另一个函数作为回调而不是 parse 方法。 (changing the rule callback parameter too) (也更改规则回调参数)

def processlinks (self,response)
    page = Website(response.url,response.text)
    DBInterface.store(page)

edit: update code + title for proper understanding..编辑:更新代码+标题以便正确理解..

CrawlSpider is a special kind of spider that adds rules support to follow links (not extract them btw). CrawlSpider是一种特殊的蜘蛛,它添加了rules支持来跟踪链接(顺便说一下,不是提取它们)。

For this spider to work, you can't override the start_requests and parse methods为了让这个蜘蛛工作,你不能覆盖start_requestsparse方法

About getting links, I would recommend using the LinkExtractor which makes the extraction cleaner:关于获取链接,我建议使用LinkExtractor使提取更清洁:

from scrapy来自scrapy

def find_links(self, response):
    for link in LinkExtractor().extract_links(response):
        logging.info('Extracting new url' + link.url)
        yield scrapy.Request(link.url, callback=self.insert_linkDB)

More and updated information for LinkExtractor is available in the documentation文档中提供有关LinkExtractor更多和更新信息

It is somewhat tricky to make CrawlSpider handle initial URLs same way as those that it subsequently extracts with LinkExtractor , which is what you want here.CrawlSpider以与随后使用LinkExtractor提取的 URL 相同的方式处理初始 URL 有点棘手,这正是您在这里想要的。 The problem is that you should not define custom callback for any requests that you manually initiate, as it would prevent LinkExtractor from working.问题是您不应该为手动启动的任何请求定义自定义回调,因为它会阻止LinkExtractor工作。 On the other hand, you want to perform some action for each crawled URL, including the initial URLs.另一方面,您希望对每个抓取的 URL 执行一些操作,包括初始 URL。 For those URLs extracted with LinkExtractor , you can provide the callback when defining the rule, but that obviously wouldn't work for initial URLs which are not extracted using these rules.对于使用LinkExtractor提取的 URL,您可以在定义规则时提供回调,但这显然不适用于未使用这些规则提取的初始 URL。 For this purpose, Scrapy provides another method parse_start_url(response) which you can, and should override.为此,Scrapy 提供了另一种方法parse_start_url(response)您可以并且应该覆盖它。 So in your case, the following would do what you want:因此,在您的情况下,以下将执行您想要的操作:

class DefaultSpider(CrawlSpider):

    name = "default"
    session_id = -1
    rules = [Rule(LinkExtractor(allow=()),callback='parse_results', follow=True)]

def start_requests(self):
     #not relevent code that gives an URL list to be scrawled
     for url in listurl:

     #make scrappy follow only the current domain url.
     self.rules[0].allow=url
        yield scrapy.Request(url=url)

def parse_start_url(self, response):
     self.parse_results(response)

def parse_results(self, response):
     page = Website(response.url,response.text)
     DBInterface.store(page)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM