[英]Why scrapy not iterating over all the links on the page even the xpaths are correct?
This code works perfectly fine when I pass extract()[0]
or extract()
- it gives me output for the first link it parsed.I am not able to understand why its doing so,bcs when I was crawling Other websites with this code it was perfectly fine. 当我传递extract()[0]
或extract()
时,此代码可以很好地工作-它为解析的第一个链接提供了输出。我无法理解为什么这样做,所以我在抓取其他网站时会这样做编码,这很好。
With this website its scraping only the first link.If I change extract()[1]
then it will give me second link and so on .Why its not working automatically in for loop? 有了这个网站,它只抓取第一个链接。如果我更改extract()[1]
,它将给我第二个链接,依此类推。为什么它不能在for循环中自动工作?
import scrapy
from scrapy.spider import BaseSpider
from scrapy.selector import Selector
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from urlparse import urljoin
class CompItem(scrapy.Item):
title = scrapy.Field()
link = scrapy.Field()
data = scrapy.Field()
name = scrapy.Field()
date = scrapy.Field()
class criticspider(BaseSpider):
name = "mmt_mouth"
allowed_domains = ["mouthshut.com"]
start_urls = ["http://www.mouthshut.com/websites/makemytripcom-reviews-925031929"]
# rules = (
# Rule(
# SgmlLinkExtractor(allow=("search=make-my-trip&page=1/+",)),
# callback="parse",
# follow=True),
# )
def parse(self, response):
sites = response.xpath('//div[@id="allreviews"]')
items = []
for site in sites:
item = CompItem()
item['name'] = site.xpath('.//li[@class="profile"]/div/a/span/text()').extract()[0]
item['title'] = site.xpath('.//div[@class="reviewtitle fl"]/strong/a/text()').extract()[0]
item['date'] = site.xpath('.//div[@class="reviewrate"]//span[@class="datetime"]/span/span/span/text()').extract()[0]
item['link'] = site.xpath('.//div[@class="reviewtitle fl"]/strong/a/@href').extract()[0]
if item['link']:
if 'http://' not in item['link']:
item['link'] = urljoin(response.url, item['link'])
yield scrapy.Request(item['link'],
meta={'item': item},
callback=self.anchor_page)
items.append(item)
def anchor_page(self, response):
old_item = response.request.meta['item']
old_item['data'] = response.xpath('.//div[@itemprop="description"]/p/text()').extract()
yield old_item
Because your for loop has nothing to loop on the given website. 因为您的for循环在给定的网站上没有任何循环。 Change your statement 更改您的声明
sites = response.xpath('//div[@id="allreviews"]')
to 至
sites = response.xpath('//div[@id="allreviews"]/ul/li')
Then your for loop can loop over the list elements. 然后,您的for循环可以遍历列表元素。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.