简体   繁体   中英

How to extract number with xpath in python if there is text around the number?

I am trying to get prices from websites and face the issue that sometimes they add extra text to the field.

eg

<span class="price--content content--default">
Ihr Preis:
13.815,00&nbsp;€
</span>

>>> response.xpath('//span[@class="price--content content--default"]/text()').extract()

['\n', '\n', '\nIhr Preis:\n13.815,00\xa0€\n']

Another example here:

<span class="price--content content--default">
Jetzt:
5.765,00&nbsp;€
</span>

How can I make sure xpath gets the number in all cases, even if there is no text but just the number?

As an alternative if not possible, how can I get the first number of the list with python?

You can do it even with an XPath 1.0 expression, provided that there is one and only one number and the python module you are ussing can deal with result data types others than node-set. Use:

translate(
   //span[@class="price--content content--default"],
   translate(//span[@class="price--content content--default"],'0123456789.,',''),
   '')

You can find these with regular expression. For example,

import re
string1 = '\nIhr Preis:\n13.815,00\xa0€\n'
string2 = '\nIhr Preis:\n5.765,00&nbsp;€\xa0€\n'
my_num = re.findall(r'\d+\.\d+', string1)   # or string2
print(my_num)

Since you are looking for a price, the problem is that, presumably, you need to extract the whole price, including cents (or whatever the equivalent is for the given currency). So, modifying your second example slightly:

my_str = '<span class="price--content content--default">Jetzt:5.765,12&nbsp;€</span>'

The output should be the whole 5.765,12 . So, without using regex, I would suggest:

for char in my_str:
if char.isdigit() or char =='.' or char==',':
    print(char, end ="")

Output:

5.765,12

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM