scrapy response.xpath只選擇第一項

Question

我有html結構

  <div class="column first">
    <div class="detail">
      <strong>Phone: </strong>
      <span class="value"> 012-345-6789</span>
    </div>
    <div class="detail">
      <span class="value">1 Street Address, Big Road, City, Country</span>
    </div>
    <div class="detail">
      <h3 class="inline">Area:</h3>
      <span class="value">Georgetown</span>
    </div>
    <div class="detail">
      <h3 class="inline">Nearest Train:</h3>
      <span class="value">Georgetown Station</span>
    </div>
    <div class="detail">
      <h3 class="inline">Website:</h3>
      <span class="value"><a href='http://www.website.com' target='_blank'>www.website.com</a></span>
    </div>
  </div>

當我在scrapy shell中運行sel = response.xpath('//span[@class="value"]/text()') ，我得到的期望值是：

[<Selector xpath='//span[@class="value"]/text()' data=u' 012-345-6789'>, <Selector xpath='//span[@class="value"]/text()' data=u'1 Street Address, Big Road, City, Country'>, <Selector xpath='//span[@class="value"]/text()' data=u'Georgetown Station'>, <Selector xpath='//span[@class="value"]/text()' data=u' '>, <Selector xpath='//span[@class="value"]/text()' data=u'January, 2016'>]

但是，在我的小蜘蛛的解析塊中，它僅返回第一項

def parse(self, response):
    def extract_with_xpath(query):
        return response.xpath(query).extract_first().strip()

    yield {
        'details': extract_with_xpath('//span[@class="value"]/text()')
    }

我知道我使用的是extract_first()但是即使我知道extract()是合法的函數，但是如果我使用extract()它也會中斷。

我做錯了什么？ 我是否需要遍歷extract_with_xpath('//span[@class="value"]/text()')部分？

多謝賜教！

Answer 1

在items.py中，指定-

from scrapy.item import Item, Field

class yourProjectNameItem(Item):
    # define the fields for your item here like:
    name = Field()
    details= Field()

在您的蜘蛛網中：進口：

from scrapy.selector import HtmlXPathSelector
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.contrib.spiders import CrawlSpider, Rule
from yourProjectName.items import yourProjectNameItem
import re

解析功能如下：

def parse_item(self, response):
    hxs = HtmlXPathSelector(response)
    i = yourProjectNameItem()

    i['name'] = hxs.select('YourXPathHere').extract() 
    i['details'] = hxs.select('YourXPathHere').extract()

    return i

希望這能解決問題。 您可以在git上參考我的項目： https : //github.com/omkar-dsd/SRMSE/tree/master/Scrapers/NasaScraper

scrapy response.xpath只選擇第一項

問題描述

1 個解決方案

解決方案1
0 2016-10-01 05:30:25

scrapy response.xpath只選擇第一項

問題描述

1 個解決方案

解決方案1 0 2016-10-01 05:30:25

解決方案1
0 2016-10-01 05:30:25