I need my scrapy to move on to the next page please give me the correct code for the rule,how to write it??
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.selector import Selector
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from delh.items import DelhItem
class criticspider(CrawlSpider):
name ="delh"
allowed_domains =["consumercomplaints.in"]
#start_urls =["http://www.consumercomplaints.in/?search=delhivery&page=2","http://www.consumercomplaints.in/?search=delhivery&page=3","http://www.consumercomplaints.in/?search=delhivery&page=4","http://www.consumercomplaints.in/?search=delhivery&page=5","http://www.consumercomplaints.in/?search=delhivery&page=6","http://www.consumercomplaints.in/?search=delhivery&page=7","http://www.consumercomplaints.in/?search=delhivery&page=8","http://www.consumercomplaints.in/?search=delhivery&page=9","http://www.consumercomplaints.in/?search=delhivery&page=10","http://www.consumercomplaints.in/?search=delhivery&page=11"]
start_urls=["http://www.consumercomplaints.in/?search=delhivery"]
rules = (Rule(SgmlLinkExtractor(restrict_xpaths=('//div[@class="pagelinks"]/a/@href',)),
callback="parse_gen", follow= True),
)
def parse_gen(self,response):
hxs = Selector(response)
sites = hxs.select('//table[@width="100%"]')
items = []
for site in sites:
item = DelhItem()
item['title'] = site.select('.//td[@class="complaint"]/a/span/text()').extract()
item['content'] = site.select('.//td[@class="compl-text"]/div/text()').extract()
items.append(item)
return items
spider=criticspider()
From my understanding you are trying to scrape two sorts of pages, hence you should use two distincts rules :
Your rules should then look something like :
rules = (
Rule(LinkExtractor(restrict_xpaths='{{ item selector }}'), callback='parse_gen'),
Rule(LinkExtractor(restrict_xpaths='//div[@class="pagelinks"]/a[contains(text(), "Next")]/@href')),
)
Explanations :
parse_gen
) as callback. The resulting responses do not go through these rules again. Notice :
SgmlLinkExtractor
is obsolete and you should use LxmlLinkExtractor
(or its alias LinkExtractor
) instead ( source ) [contains(text(), "Next")]
selector to the "pagelinks" rule. This way each "list page" gets requested exactly once
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.