[英]Strange XPath results in Scrapy shell
I'm trying to select an item on page: 我试图在页面上选择一个项目:
http://www.betterware.co.uk/catalog/product/view/id/4530/category/342/ http://www.betterware.co.uk/catalog/product/view/id/4530/category/342/
using variations of XPath such as: 使用各种XPath,例如:
sel.xpath('//div[@class="price-box"]/span[@class="regular-price"]/span[@class="price"]/text()').extract()
the html source I'm looking at is: 我正在查看的html源是:
<div class="price-box">
<span class="regular-price" id="product-price-4530">
<span class="price">£12.99</span>
</span>
</div>
Rather than getting the correct [u'£12.99']
, I get a bunch of other numbers that don't even appear in the page source. 我没有得到正确的
[u'£12.99']
,而是得到了很多其他甚至没有出现在页面源中的数字。 Scrapy shell gives: cra壳提供:
[u'\xa312.99',
u'\xa38.99',
u'\xa38.99',
u'\xa34.49',
u'\xa34.49',
u'\xa329.99',
u'\xa329.99']
I've had no trouble selecting other items in this manner, but this and all my other price fields are suffering these mysterious results for the price text. 我毫不费力地以这种方式选择其他项目,但是对于价格文本,这和我的所有其他价格字段都在遭受这些神秘的结果。 Can someone please shed some light for me here?
有人可以在这里为我阐明一下吗? My python code for the items selection is:
我选择项目的python代码是:
def parse_again(self, response):
sel = Selector(response)
meta = sel.xpath('//div[@class="product-main-info"]')
items = []
for m in meta:
item = BetterItem()
item['link'] = response.url
item['item_name'] = m.select('//div[@class="product-name"]/h1/text()').extract()
item['sku'] = m.select('//p[@class="product-ids"]/text()').extract()
item['price'] = m.select('//div[@class="price-box"]/span/span/text()').extract()
items.append(item)
return items
There is nothing wrong with the result being returned by Scrapy. Scrapy返回的结果没有错。
u'\\xa3'
is the pound sign: u'\\xa3'
是英镑符号:
In [99]: import unicodedata as UD
In [100]: UD.name(u'\xa3')
Out[100]: 'POUND SIGN'
In [101]: print(u'\xa3')
£
u'\\xa312.99'
is the pound sign u'\\xa3
followed by the unicode u'12.99'
. u'\\xa312.99'
是井号u'\\xa3
后跟Unicode u'12.99'
。
If you wish to strip the pound signs from the list, you could do this: 如果要从列表中删除井号,可以执行以下操作:
In [108]: data = [u'\xa312.99',
u'\xa38.99',
u'\xa38.99',
u'\xa34.49',
u'\xa34.49',
u'\xa329.99',
u'\xa329.99']
In [110]: [float(item.lstrip(u'\xa3')) for item in data]
Out[110]: [12.99, 8.99, 8.99, 4.49, 4.49, 29.99, 29.99]
The following articles are "must-reads" for anyone dealing with unicode: 以下文章是处理unicode的任何人的“必读”:
and particularly for a Python-centric point of view: 特别是对于以Python为中心的观点:
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.