[英]Scrapy needs to crawl all the next links on website and move on to the next page
我需要抓緊時間才能轉到下一頁,請給我該規則的正確代碼,該怎么寫?
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.selector import Selector
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from delh.items import DelhItem
class criticspider(CrawlSpider):
name ="delh"
allowed_domains =["consumercomplaints.in"]
#start_urls =["http://www.consumercomplaints.in/?search=delhivery&page=2","http://www.consumercomplaints.in/?search=delhivery&page=3","http://www.consumercomplaints.in/?search=delhivery&page=4","http://www.consumercomplaints.in/?search=delhivery&page=5","http://www.consumercomplaints.in/?search=delhivery&page=6","http://www.consumercomplaints.in/?search=delhivery&page=7","http://www.consumercomplaints.in/?search=delhivery&page=8","http://www.consumercomplaints.in/?search=delhivery&page=9","http://www.consumercomplaints.in/?search=delhivery&page=10","http://www.consumercomplaints.in/?search=delhivery&page=11"]
start_urls=["http://www.consumercomplaints.in/?search=delhivery"]
rules = (Rule(SgmlLinkExtractor(restrict_xpaths=('//div[@class="pagelinks"]/a/@href',)),
callback="parse_gen", follow= True),
)
def parse_gen(self,response):
hxs = Selector(response)
sites = hxs.select('//table[@width="100%"]')
items = []
for site in sites:
item = DelhItem()
item['title'] = site.select('.//td[@class="complaint"]/a/span/text()').extract()
item['content'] = site.select('.//td[@class="compl-text"]/div/text()').extract()
items.append(item)
return items
spider=criticspider()
據我了解,您正在嘗試抓取兩種頁面,因此您應該使用兩個不同的規則:
您的規則應如下所示:
rules = (
Rule(LinkExtractor(restrict_xpaths='{{ item selector }}'), callback='parse_gen'),
Rule(LinkExtractor(restrict_xpaths='//div[@class="pagelinks"]/a[contains(text(), "Next")]/@href')),
)
說明:
parse_gen
)作為回調。 產生的響應不會再通過這些規則。 注意 :
SgmlLinkExtractor
已過時,您應該改用LxmlLinkExtractor
(或其別名LinkExtractor
)( 源 ) [contains(text(), "Next")]
pagelinks]規則中添加了[contains(text(), "Next")]
選擇器。 這樣,每個“列表頁面”都會被精確請求一次
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.