简体   繁体   English

如何跟踪深度为2的Scrapy链接?

[英]How to follow links with Scrapy with a depth of 2?

I am writing a scraper which should extract all links from an initial webpage if it has any given keywords in the metadata and if they contain 'htt' in the URL, follow them and repeat the procedure twice, so the depth of the scraping will be 2. This is my code: 我正在编写一个刮板,如果其在元数据中包含任何给定的关键字,并且URL中包含“ htt”,则应该从初始网页中提取所有链接,然后按照它们进行两次操作,因此刮取的深度为2.这是我的代码:

from scrapy.spider import Spider
from scrapy import Selector
from socialmedia.items import SocialMediaItem
from scrapy.contrib.spiders import Rule, CrawlSpider
from scrapy.selector import HtmlXPathSelector
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor

class MySpider(Spider):
    name = 'smm'
    allowed_domains = ['*']
    start_urls = ['http://en.wikipedia.org/wiki/Social_media']
    rules = (
             Rule(SgmlLinkExtractor(allow=()), callback="parse_items", follow= True),
             )
    def parse_items(self, response):
        items = []
        #Define keywords present in metadata to scrap the webpage
        keywords = ['social media','social business','social networking','social marketing','online marketing','social selling',
            'social customer experience management','social cxm','social cem','social crm','google analytics','seo','sem',
            'digital marketing','social media manager','community manager']
        for link in response.xpath("//a"):
            item = SocialMediaItem()
            #Extract webpage keywords 
            metakeywords = link.xpath('//meta[@name="keywords"]').extract()
            #Compare keywords and extract if one of the defined keyboards is present in the metadata
            if (keywords in metaKW for metaKW in metakeywords):
                    item['SourceTitle'] = link.xpath('/html/head/title').extract()
                    item['TargetTitle'] = link.xpath('text()').extract()
                    item['link'] = link.xpath('@href').extract()
                    outbound = str(link.xpath('@href').extract())
                    if 'http' in outbound:
                        items.append(item)
        return items

But I get this error: 但是我得到这个错误:

    Traceback (most recent call last):
      File "C:\Anaconda\lib\site-packages\twisted\internet\base.py", line 1201, in mainLoop
        self.runUntilCurrent()
      File "C:\Anaconda\lib\site-packages\twisted\internet\base.py", line 824, in runUntilCurrent
        call.func(*call.args, **call.kw)
      File "C:\Anaconda\lib\site-packages\twisted\internet\defer.py", line 382, in callback
        self._startRunCallbacks(result)
      File "C:\Anaconda\lib\site-packages\twisted\internet\defer.py", line 490, in _startRunCallbacks
        self._runCallbacks()
    --- <exception caught here> ---
      File "C:\Anaconda\lib\site-packages\twisted\internet\defer.py", line 577, in _runCallbacks
        current.result = callback(current.result, *args, **kw)
      File "C:\Anaconda\lib\site-packages\scrapy\spider.py", line 56, in parse
        raise NotImplementedError
    exceptions.NotImplementedError: 

Can you help me to follow the links containing http in its URL? 您能帮我追踪URL中包含http的链接吗? Thanks! 谢谢!

Dani 达尼

It is ignoring the rule for two main reasons here: 它忽略此规则有两个主要原因:

  • you need to use CrawlSpider , not a regular Spider 您需要使用CrawlSpider ,而不是常规的Spider
  • specified callback parse_items() doesn't exist. 指定的回调parse_items()不存在。 Rename parse() to parse_items() . parse()重命名为parse_items()

在您的代码中,将class MySpider(Spider):更改为class Myspider(crawlSpider)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM