While scraping the content of a website, I encountered an issue with promotional prices which are crossed and replaced with another price (with <del>
and <ins>
html hypertext markups).
Here is the HTML source code of the bit I am trying to take:
<span class="price"><del> <span class="woocommerce-Price-amount amount"> <bdi>49,00 <span class="woocommerce-Price-currencySymbol">MAD</span> </bdi></span></del> <ins><span class="woocommerce-Price-amount amount"><bdi>35,00 <span class="woocommerce-Price-currencySymbol">MAD</span></bdi></span></ins></span>
I am trying to select only the part in <ins>
.
I have so far used this code to extract the price, but it makes no distinction between the crossed and actual price.
sourceCode.xpath('//span[@class="price"]/descendant::node()/text()')
I cannot figure out how to only select the <ins>
part.
How about we actually follow what <del>
means conceptually and remove del
elements before extracting price values:
In [1]: from lxml import html
In [2]: data = """<span class="price"><del>
...: <span class="woocommerce-Price-amount amount">
...: <bdi>49,00 <span class="woocommerce-Price-currencySymbol">MAD</span>
...: </bdi></span></del>
...: <ins><span class="woocommerce-Price-amount amount"><bdi>35,00 <span class="woocommerce-Price-currencySymbol">MAD</span></bdi></span></ins></span>"""
In [3]: root = html.fromstring(data)
In [4]: for del_element in root.xpath('//span[@class="price"]//del'):
del_element.getparent().remove(del_element)
In [5]: root.xpath('//span[@class="price"]/descendant::node()/text()')
Out[5]: ['35,00\xa0', 'MAD']
I'd argue this is likely better than trying to write XPath expressions to handle the cases where they have both old and new prices or just a single price.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.