[英]How to follow links with Scrapy with a depth of 2?
I am writing a scraper which should extract all links from an initial webpage if it has any given keywords in the metadata and if they contain 'htt' in the URL, follow them and repeat the procedure twice, so the depth of the scraping will be 2. This is my code: 我正在编写一个刮板,如果其在元数据中包含任何给定的关键字,并且URL中包含“ htt”,则应该从初始网页中提取所有链接,然后按照它们进行两次操作,因此刮取的深度为2.这是我的代码:
from scrapy.spider import Spider
from scrapy import Selector
from socialmedia.items import SocialMediaItem
from scrapy.contrib.spiders import Rule, CrawlSpider
from scrapy.selector import HtmlXPathSelector
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
class MySpider(Spider):
name = 'smm'
allowed_domains = ['*']
start_urls = ['http://en.wikipedia.org/wiki/Social_media']
rules = (
Rule(SgmlLinkExtractor(allow=()), callback="parse_items", follow= True),
)
def parse_items(self, response):
items = []
#Define keywords present in metadata to scrap the webpage
keywords = ['social media','social business','social networking','social marketing','online marketing','social selling',
'social customer experience management','social cxm','social cem','social crm','google analytics','seo','sem',
'digital marketing','social media manager','community manager']
for link in response.xpath("//a"):
item = SocialMediaItem()
#Extract webpage keywords
metakeywords = link.xpath('//meta[@name="keywords"]').extract()
#Compare keywords and extract if one of the defined keyboards is present in the metadata
if (keywords in metaKW for metaKW in metakeywords):
item['SourceTitle'] = link.xpath('/html/head/title').extract()
item['TargetTitle'] = link.xpath('text()').extract()
item['link'] = link.xpath('@href').extract()
outbound = str(link.xpath('@href').extract())
if 'http' in outbound:
items.append(item)
return items
But I get this error: 但是我得到这个错误:
Traceback (most recent call last):
File "C:\Anaconda\lib\site-packages\twisted\internet\base.py", line 1201, in mainLoop
self.runUntilCurrent()
File "C:\Anaconda\lib\site-packages\twisted\internet\base.py", line 824, in runUntilCurrent
call.func(*call.args, **call.kw)
File "C:\Anaconda\lib\site-packages\twisted\internet\defer.py", line 382, in callback
self._startRunCallbacks(result)
File "C:\Anaconda\lib\site-packages\twisted\internet\defer.py", line 490, in _startRunCallbacks
self._runCallbacks()
--- <exception caught here> ---
File "C:\Anaconda\lib\site-packages\twisted\internet\defer.py", line 577, in _runCallbacks
current.result = callback(current.result, *args, **kw)
File "C:\Anaconda\lib\site-packages\scrapy\spider.py", line 56, in parse
raise NotImplementedError
exceptions.NotImplementedError:
Can you help me to follow the links containing http in its URL? 您能帮我追踪URL中包含http的链接吗? Thanks!
谢谢!
Dani 达尼
It is ignoring the rule for two main reasons here: 它忽略此规则有两个主要原因:
CrawlSpider
, not a regular Spider
CrawlSpider
,而不是常规的Spider
parse_items()
doesn't exist. parse_items()
不存在。 Rename parse()
to parse_items()
. parse()
重命名为parse_items()
。 在您的代码中,将class MySpider(Spider):
更改为class Myspider(crawlSpider)
。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.