简体   繁体   English

在Scrapy中传递xPath转换功能不适用于特殊字符

[英]Passing xPath translate function in Scrapy not working for special characters

I'm building a Scrapy spider which takes the xpath query as an input parameter. 我正在构建一个Scrapy Spider,它将xpath查询作为输入参数。

The specific page I'm trying to scrape has line feeds, new lines and other characters within the price text field and I'm using the translate() function to remove them. 我要抓取的特定页面在价格文本字段中包含换行符,换行符和其他字符,并且我正在使用translate()函数将其删除。

The selector works fine with the translate if explicitly included in the code but the translate doesn't work if passed as a parameter. 如果代码中明确包含选择器,则选择器可以很好地与转换配合使用,但是如果作为参数传递,则转换将不起作用。

Here is my spider: 这是我的蜘蛛:

import scrapy
from spotlite.items import SpotliteItem


class GenericSpider(scrapy.Spider):
   name = "generic"
   xpath_string = ""

   def __init__(self, start_url, allowed_domains, xpath_string, *args, **kwargs):
       super(GenericSpider, self).__init__(*args, **kwargs)
       self.start_urls = ['%s' % start_url]
       self.allowed_domains = ['%s' % allowed_domains]
       self.xpath_string = xpath_string

    def parse(self, response):
       self.logger.info('URL is %s', response.url)
       self.logger.info('xPath is %s', self.xpath_string)
       item = SpotliteItem()
       item['url'] = response.url
       item['price'] = response.xpath(self.xpath_string).extract()
       return item

And I use the following to call the spider. 我使用以下方法来称呼蜘蛛。

scrapy crawl generic -a start_url=https://www.danmurphys.com.au/product/DM_4034/penfolds-kalimna-bin-28-shiraz -a allowed_domains=danmurphys.com.au -a "xpath_string=translate((//span[@class='price'])[1]/text(),',$\r\n\t','')"

The issue seems to be passing specical characters in the argument ie \\r\\n\\t. 问题似乎是在参数中传递了特殊字符,即\\ r \\ n \\ t。

The '$'character is correctly removed but the \\r\\n\\t characters are not as per the output below. 正确删除了'$'字符,但\\ r \\ n \\ t字符与下面的输出不同。

{'price': [u'\r\n\t\t\t\t\t\t\t\t\t\r\n\t\t\t\t\t\t\t\t\t\t27.50\r\n\t\t\t\t\t\t\t\t\t\r\n\t\t\t\t\t\t\t\t'],
 'url': 'https://www.danmurphys.com.au/product/DM_4034/penfolds-kalimna-bin-28-shiraz.jsp;jsessionid=B0211294F13A980CA41261379CD83541.ncdlmorasp1301?bmUID=loERXI6'}

Any assistance or advice will be appreciated! 任何帮助或建议,将不胜感激!

Thanks, 谢谢,

Michael 麦可

Try using the normalize-space() XPath function in your selector: 尝试在选择器中使用normalize-space() XPath函数:

scrapy crawl generic -a start_url=<URL> -a \
    allowed_domains=danmurphys.com.au \
    -a "xpath_string=normalize-space(//span[@class='price'][1]/text())"

In your parse method, you can use the extract_first() to get the price as a single string object, instead of a list: 在您的parse方法中,您可以使用extract_first()将价格作为单个字符串对象而不是列表来获取:

item['price'] = response.xpath(self.xpath_string).extract_first()

You could also use the re_first() method to remove the $ sign from the string: 您还可以使用re_first()方法从字符串中删除$符号:

item['price'] = response.xpath(self.xpath_string).re_first("\$(.+)")

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM