简体   繁体   中英

Scrapy Spider returning only last element when given a list of Selectors

I've been running into an issue with a spider I've put together. I am trying to scrape individual lines of text, along with their corresponding timestamps, from the transcript on this site , and have found what I believe are the appropriate selectors, but when run, the spider's output is just the last line and timestamp. I've seen a couple others with similar issues, but haven't yet found an answer that solves my problem.

Here is the spider:

# -*- coding: utf-8 -*-
import scrapy
from this_american_life.items import TalTranscriptItem

class CrawlSpider(scrapy.Spider):
    name = "transcript2"
    allowed_domains = ["https://www.thisamericanlife.org/radio-archives/episode/1/transcript"]
    start_urls = (
        'https://www.thisamericanlife.org/radio-archives/episode/1/transcript',
    )

    def parse(self, response):
        item = TalTranscriptItem()
        for line in response.xpath('//p'):
            item['begin_timestamp'] = line.xpath('//@begin').extract()
            item['line_text'] = line.xpath('//text()').extract()
        yield item

And here is the code for TalTranscriptItem() in items.py :

# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# http://doc.scrapy.org/en/latest/topics/items.html

import scrapy


class TalTranscriptItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    episode_id = scrapy.Field()
    episode_num_text = scrapy.Field()
    year = scrapy.Field()
    radio_date_text = scrapy.Field()
    radio_date_datetime = scrapy.Field()
    episode_title = scrapy.Field()
    episode_hosts = scrapy.Field()
    act_id = scrapy.Field()
    line_id = scrapy.Field()
    begin_timestamp = scrapy.Field()
    speaker_class = scrapy.Field()
    speaker_name = scrapy.Field()
    line_text = scrapy.Field()
    full_audio_link = scrapy.Field()
    transcript_url = scrapy.Field()

When run in the scrapy shell , it appears to work correctly (drawing all of the lines of text), but for some reason I haven't been able to get it to work in the spider.

I'm happy to clarify any of these issues, and would greatly appreciate any help anyone can offer!

If you want each individual line yielded as an item I think this is what you want (notice the last indentation for the yield line):

for line in response.css('p'):
    item = TalTranscriptItem()
    item['begin_timestamp'] = line.xpath('./@begin').extract_first()
    item['line_text'] = line.xpath('./text()').extract_first()
    yield item

I don't know what item is but you can do:

item = []

for line in response.xpath('//p'):
   dictItem = {'begin_timestamp':line.xpath('//@begin').extract(),'line_text':line.xpath('//text()').extract()}
   item.append(dictItem)

print(item)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM