[英]Xpath with scrapy: I get everything hundred times
我正在将Scrapy 1.2与Xpath(当然还有python 3.4)一起使用,以读取billboard.com上的Hot 100图表。 当我在代码中使用第二个选项时,每首歌我都会获得全部100个标题。 我知道是因为双/; 但我无法使第一个选项起作用。 如何确保每首歌只有正确的标题?
class MusicalSpider(scrapy.Spider):
name = "musicalspider"
allowed_domains = ["billboard.com"]
start_urls = ['http://www.billboard.com/charts/hot-100/']
def parse(self, response):
songs = response.xpath('//div[@class="chart-data js-chart-data"]/div[@class="container"]/article')
for song in songs:
item = MusicItem()
# first option:
item['title'] = song.xpath('div[@class="chart-row__primary"]/div[@class="chart-row__main-display"]/div[@class="chart-row__container"]/div[@class="chart-row__title"]/h2[@class="chart-row__song"]').extract()
# second option:
item['title'] = song.xpath('//h2[@class="chart-row__song"]').extract()
yield item
这是一个很普遍的问题。 切记以点开头的内循环XPath表达式-这会使它们与上下文有关 :
for song in songs:
item = MusicItem()
# first option:
item['title'] = song.xpath('.//div[@class="chart-row__primary"]/div[@class="chart-row__main-display"]/div[@class="chart-row__container"]/div[@class="chart-row__title"]/h2[@class="chart-row__song"]').extract()
# second option:
item['title'] = song.xpath('.//h2[@class="chart-row__song"]').extract()
yield item
更多信息请访问:
这是对我有用的蜘蛛:
import scrapy
class MusicalSpider(scrapy.Spider):
name = "musicalspider"
allowed_domains = ["billboard.com"]
start_urls = ['http://www.billboard.com/charts/hot-100/']
def parse(self, response):
songs = response.xpath('//div[@class="chart-data js-chart-data"]/div[@class="container"]/article')
for song in songs:
item = MusicItem()
item['title'] = song.xpath('.//h2[@class="chart-row__song"]/text()').extract_first()
yield item
它产生以下项目:
{'title': u'Black Beatles'}
{'title': u'Closer'}
...
{'title': u'Hold Up'}
{'title': u'Gangsta'}
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.