[英]Scrapy needs to crawl all the next links on website and move on to the next page
I need my scrapy to move on to the next page please give me the correct code for the rule,how to write it?? 我需要抓紧时间才能转到下一页,请给我该规则的正确代码,该怎么写?
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.selector import Selector
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from delh.items import DelhItem
class criticspider(CrawlSpider):
name ="delh"
allowed_domains =["consumercomplaints.in"]
#start_urls =["http://www.consumercomplaints.in/?search=delhivery&page=2","http://www.consumercomplaints.in/?search=delhivery&page=3","http://www.consumercomplaints.in/?search=delhivery&page=4","http://www.consumercomplaints.in/?search=delhivery&page=5","http://www.consumercomplaints.in/?search=delhivery&page=6","http://www.consumercomplaints.in/?search=delhivery&page=7","http://www.consumercomplaints.in/?search=delhivery&page=8","http://www.consumercomplaints.in/?search=delhivery&page=9","http://www.consumercomplaints.in/?search=delhivery&page=10","http://www.consumercomplaints.in/?search=delhivery&page=11"]
start_urls=["http://www.consumercomplaints.in/?search=delhivery"]
rules = (Rule(SgmlLinkExtractor(restrict_xpaths=('//div[@class="pagelinks"]/a/@href',)),
callback="parse_gen", follow= True),
)
def parse_gen(self,response):
hxs = Selector(response)
sites = hxs.select('//table[@width="100%"]')
items = []
for site in sites:
item = DelhItem()
item['title'] = site.select('.//td[@class="complaint"]/a/span/text()').extract()
item['content'] = site.select('.//td[@class="compl-text"]/div/text()').extract()
items.append(item)
return items
spider=criticspider()
From my understanding you are trying to scrape two sorts of pages, hence you should use two distincts rules : 据我了解,您正在尝试抓取两种页面,因此您应该使用两个不同的规则:
Your rules should then look something like : 您的规则应如下所示:
rules = (
Rule(LinkExtractor(restrict_xpaths='{{ item selector }}'), callback='parse_gen'),
Rule(LinkExtractor(restrict_xpaths='//div[@class="pagelinks"]/a[contains(text(), "Next")]/@href')),
)
Explanations : 说明:
parse_gen
) as callback. parse_gen
)作为回调。 The resulting responses do not go through these rules again. Notice : 注意 :
SgmlLinkExtractor
is obsolete and you should use LxmlLinkExtractor
(or its alias LinkExtractor
) instead ( source ) SgmlLinkExtractor
已过时,您应该改用LxmlLinkExtractor
(或其别名LinkExtractor
)( 源 ) [contains(text(), "Next")]
selector to the "pagelinks" rule. [contains(text(), "Next")]
pagelinks]规则中添加了[contains(text(), "Next")]
选择器。 This way each "list page" gets requested exactly once
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.