[英]Scrapy Spider returning only last element when given a list of Selectors
I've been running into an issue with a spider I've put together. 我遇到了一个我放在一起的蜘蛛的问题。 I am trying to scrape individual lines of text, along with their corresponding timestamps, from the transcript on this site , and have found what I believe are the appropriate selectors, but when run, the spider's output is just the last line and timestamp. 我正在尝试从此站点上的脚本中抓取文本的各个行及其相应的时间戳,并发现了我认为合适的选择器,但是运行时,spider的输出只是最后一行和时间戳。 I've seen a couple others with similar issues, but haven't yet found an answer that solves my problem. 我见过其他一些有类似问题的人,但还没有找到解决我问题的答案。
Here is the spider: 这是蜘蛛:
# -*- coding: utf-8 -*-
import scrapy
from this_american_life.items import TalTranscriptItem
class CrawlSpider(scrapy.Spider):
name = "transcript2"
allowed_domains = ["https://www.thisamericanlife.org/radio-archives/episode/1/transcript"]
start_urls = (
'https://www.thisamericanlife.org/radio-archives/episode/1/transcript',
)
def parse(self, response):
item = TalTranscriptItem()
for line in response.xpath('//p'):
item['begin_timestamp'] = line.xpath('//@begin').extract()
item['line_text'] = line.xpath('//text()').extract()
yield item
And here is the code for TalTranscriptItem()
in items.py
: 这里是代码TalTranscriptItem()
在items.py
:
# -*- coding: utf-8 -*-
# Define here the models for your scraped items
#
# See documentation in:
# http://doc.scrapy.org/en/latest/topics/items.html
import scrapy
class TalTranscriptItem(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
episode_id = scrapy.Field()
episode_num_text = scrapy.Field()
year = scrapy.Field()
radio_date_text = scrapy.Field()
radio_date_datetime = scrapy.Field()
episode_title = scrapy.Field()
episode_hosts = scrapy.Field()
act_id = scrapy.Field()
line_id = scrapy.Field()
begin_timestamp = scrapy.Field()
speaker_class = scrapy.Field()
speaker_name = scrapy.Field()
line_text = scrapy.Field()
full_audio_link = scrapy.Field()
transcript_url = scrapy.Field()
When run in the scrapy shell
, it appears to work correctly (drawing all of the lines of text), but for some reason I haven't been able to get it to work in the spider. 在scrapy shell
运行时,它似乎可以正常工作(绘制所有文本行),但是由于某种原因,我无法使其在Spider中工作。
I'm happy to clarify any of these issues, and would greatly appreciate any help anyone can offer! 我很高兴澄清所有这些问题,非常感谢任何人都可以提供的任何帮助!
If you want each individual line yielded as an item I think this is what you want (notice the last indentation for the yield
line): 如果您希望将每个行都作为项目产生,那么我想这就是您想要的(请注意yield
行的最后一个缩进):
for line in response.css('p'):
item = TalTranscriptItem()
item['begin_timestamp'] = line.xpath('./@begin').extract_first()
item['line_text'] = line.xpath('./text()').extract_first()
yield item
I don't know what item is but you can do: 我不知道什么是物品,但您可以这样做:
item = []
for line in response.xpath('//p'):
dictItem = {'begin_timestamp':line.xpath('//@begin').extract(),'line_text':line.xpath('//text()').extract()}
item.append(dictItem)
print(item)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.