奇怪的XPath导致Scrapy shell

Question

I'm trying to select an item on page: 我试图在页面上选择一个项目：

http://www.betterware.co.uk/catalog/product/view/id/4530/category/342/ http://www.betterware.co.uk/catalog/product/view/id/4530/category/342/

using variations of XPath such as: 使用各种XPath，例如：

sel.xpath('//div[@class="price-box"]/span[@class="regular-price"]/span[@class="price"]/text()').extract()

the html source I'm looking at is: 我正在查看的html源是：

<div class="price-box">
    <span class="regular-price" id="product-price-4530">
        <span class="price">£12.99</span>
    </span>
</div>

Rather than getting the correct [u'£12.99'] , I get a bunch of other numbers that don't even appear in the page source. 我没有得到正确的[u'£12.99'] ，而是得到了很多其他甚至没有出现在页面源中的数字。 Scrapy shell gives: cra壳提供：

[u'\xa312.99',
 u'\xa38.99',
 u'\xa38.99',
 u'\xa34.49',
 u'\xa34.49',
 u'\xa329.99',
 u'\xa329.99']

I've had no trouble selecting other items in this manner, but this and all my other price fields are suffering these mysterious results for the price text. 我毫不费力地以这种方式选择其他项目，但是对于价格文本，这和我的所有其他价格字段都在遭受这些神秘的结果。 Can someone please shed some light for me here? 有人可以在这里为我阐明一下吗？ My python code for the items selection is: 我选择项目的python代码是：

def parse_again(self, response):
    sel = Selector(response)
    meta = sel.xpath('//div[@class="product-main-info"]')
    items = []
    for m in meta:
        item = BetterItem()
        item['link'] = response.url
        item['item_name'] = m.select('//div[@class="product-name"]/h1/text()').extract()
        item['sku'] = m.select('//p[@class="product-ids"]/text()').extract()
        item['price'] = m.select('//div[@class="price-box"]/span/span/text()').extract()
        items.append(item)
    return items

Answer 1

There is nothing wrong with the result being returned by Scrapy. Scrapy返回的结果没有错。 u'\\xa3' is the pound sign: u'\\xa3'是英镑符号：

In [99]: import unicodedata as UD

In [100]: UD.name(u'\xa3')
Out[100]: 'POUND SIGN'

In [101]: print(u'\xa3')
£

u'\\xa312.99' is the pound sign u'\\xa3 followed by the unicode u'12.99' . u'\\xa312.99'是井号u'\\xa3后跟Unicode u'12.99' 。

If you wish to strip the pound signs from the list, you could do this: 如果要从列表中删除井号，可以执行以下操作：

In [108]: data = [u'\xa312.99',
 u'\xa38.99',
 u'\xa38.99',
 u'\xa34.49',
 u'\xa34.49',
 u'\xa329.99',
 u'\xa329.99']

In [110]: [float(item.lstrip(u'\xa3')) for item in data]
Out[110]: [12.99, 8.99, 8.99, 4.49, 4.49, 29.99, 29.99]

The following articles are "must-reads" for anyone dealing with unicode: 以下文章是处理unicode的任何人的“必读”：

The Absolute Minimum Every Software Developer Must Know About Unicode 每个软件开发人员必须了解的Unicode绝对最低要求

and particularly for a Python-centric point of view: 特别是对于以Python为中心的观点：

Unicode HOWTO Unicode HOWTO
Pragmatic Unicode 实用Unicode

奇怪的XPath导致Scrapy shell

问题描述

1 个解决方案

解决方案1
1 已采纳 2013-12-20 23:20:02

奇怪的XPath导致Scrapy shell

问题描述

1 个解决方案

解决方案1 1 已采纳 2013-12-20 23:20:02

解决方案1
1 已采纳 2013-12-20 23:20:02