在Scrapy中傳遞xPath轉換功能不適用於特殊字符

Question

我正在構建一個Scrapy Spider，它將xpath查詢作為輸入參數。

我要抓取的特定頁面在價格文本字段中包含換行符，換行符和其他字符，並且我正在使用translate()函數將其刪除。

如果代碼中明確包含選擇器，則選擇器可以很好地與轉換配合使用，但是如果作為參數傳遞，則轉換將不起作用。

這是我的蜘蛛：

import scrapy
from spotlite.items import SpotliteItem


class GenericSpider(scrapy.Spider):
   name = "generic"
   xpath_string = ""

   def __init__(self, start_url, allowed_domains, xpath_string, *args, **kwargs):
       super(GenericSpider, self).__init__(*args, **kwargs)
       self.start_urls = ['%s' % start_url]
       self.allowed_domains = ['%s' % allowed_domains]
       self.xpath_string = xpath_string

    def parse(self, response):
       self.logger.info('URL is %s', response.url)
       self.logger.info('xPath is %s', self.xpath_string)
       item = SpotliteItem()
       item['url'] = response.url
       item['price'] = response.xpath(self.xpath_string).extract()
       return item

我使用以下方法來稱呼蜘蛛。

scrapy crawl generic -a start_url=https://www.danmurphys.com.au/product/DM_4034/penfolds-kalimna-bin-28-shiraz -a allowed_domains=danmurphys.com.au -a "xpath_string=translate((//span[@class='price'])[1]/text(),',$\r\n\t','')"

問題似乎是在參數中傳遞了特殊字符，即\\ r \\ n \\ t。

正確刪除了'$'字符，但\\ r \\ n \\ t字符與下面的輸出不同。

{'price': [u'\r\n\t\t\t\t\t\t\t\t\t\r\n\t\t\t\t\t\t\t\t\t\t27.50\r\n\t\t\t\t\t\t\t\t\t\r\n\t\t\t\t\t\t\t\t'],
 'url': 'https://www.danmurphys.com.au/product/DM_4034/penfolds-kalimna-bin-28-shiraz.jsp;jsessionid=B0211294F13A980CA41261379CD83541.ncdlmorasp1301?bmUID=loERXI6'}

任何幫助或建議，將不勝感激！

謝謝，

麥可

Answer 1

嘗試在選擇器中使用normalize-space() XPath函數：

scrapy crawl generic -a start_url=<URL> -a \
    allowed_domains=danmurphys.com.au \
    -a "xpath_string=normalize-space(//span[@class='price'][1]/text())"

在您的parse方法中，您可以使用extract_first()將價格作為單個字符串對象而不是列表來獲取：

item['price'] = response.xpath(self.xpath_string).extract_first()

您還可以使用re_first()方法從字符串中刪除$符號：

item['price'] = response.xpath(self.xpath_string).re_first("\$(.+)")

在Scrapy中傳遞xPath轉換功能不適用於特殊字符

問題描述

1 個解決方案

解決方案1
0 2016-08-03 02:39:03

在Scrapy中傳遞xPath轉換功能不適用於特殊字符

問題描述

1 個解決方案

解決方案1 0 2016-08-03 02:39:03

解決方案1
0 2016-08-03 02:39:03