简体   繁体   English

奇怪的XPath导致Scrapy shell

[英]Strange XPath results in Scrapy shell

I'm trying to select an item on page: 我试图在页面上选择一个项目:

http://www.betterware.co.uk/catalog/product/view/id/4530/category/342/ http://www.betterware.co.uk/catalog/product/view/id/4530/category/342/

using variations of XPath such as: 使用各种XPath,例如:

sel.xpath('//div[@class="price-box"]/span[@class="regular-price"]/span[@class="price"]/text()').extract()

the html source I'm looking at is: 我正在查看的html源是:

<div class="price-box">
    <span class="regular-price" id="product-price-4530">
        <span class="price">£12.99</span>
    </span>
</div>

Rather than getting the correct [u'£12.99'] , I get a bunch of other numbers that don't even appear in the page source. 我没有得到正确的[u'£12.99'] ,而是得到了很多其他甚至没有出现在页面源中的数字。 Scrapy shell gives: cra壳提供:

[u'\xa312.99',
 u'\xa38.99',
 u'\xa38.99',
 u'\xa34.49',
 u'\xa34.49',
 u'\xa329.99',
 u'\xa329.99']

I've had no trouble selecting other items in this manner, but this and all my other price fields are suffering these mysterious results for the price text. 我毫不费力地以这种方式选择其他项目,但是对于价格文本,这和我的所有其他价格字段都在遭受这些神秘的结果。 Can someone please shed some light for me here? 有人可以在这里为我阐明一下吗? My python code for the items selection is: 我选择项目的python代码是:

def parse_again(self, response):
    sel = Selector(response)
    meta = sel.xpath('//div[@class="product-main-info"]')
    items = []
    for m in meta:
        item = BetterItem()
        item['link'] = response.url
        item['item_name'] = m.select('//div[@class="product-name"]/h1/text()').extract()
        item['sku'] = m.select('//p[@class="product-ids"]/text()').extract()
        item['price'] = m.select('//div[@class="price-box"]/span/span/text()').extract()
        items.append(item)
    return items

There is nothing wrong with the result being returned by Scrapy. Scrapy返回的结果没有错。 u'\\xa3' is the pound sign: u'\\xa3'是英镑符号:

In [99]: import unicodedata as UD

In [100]: UD.name(u'\xa3')
Out[100]: 'POUND SIGN'

In [101]: print(u'\xa3')
£

u'\\xa312.99' is the pound sign u'\\xa3 followed by the unicode u'12.99' . u'\\xa312.99'是井号u'\\xa3后跟Unicode u'12.99'

If you wish to strip the pound signs from the list, you could do this: 如果要从列表中删除井号,可以执行以下操作:

In [108]: data = [u'\xa312.99',
 u'\xa38.99',
 u'\xa38.99',
 u'\xa34.49',
 u'\xa34.49',
 u'\xa329.99',
 u'\xa329.99']

In [110]: [float(item.lstrip(u'\xa3')) for item in data]
Out[110]: [12.99, 8.99, 8.99, 4.49, 4.49, 29.99, 29.99]

The following articles are "must-reads" for anyone dealing with unicode: 以下文章是处理unicode的任何人的“必读”:

and particularly for a Python-centric point of view: 特别是对于以Python为中心的观点:

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM