简体   繁体   中英

How to select specific HTML hypertext markup on Python with html.xpath

While scraping the content of a website, I encountered an issue with promotional prices which are crossed and replaced with another price (with <del> and <ins> html hypertext markups).

Here is the HTML source code of the bit I am trying to take:

 <span class="price"><del> <span class="woocommerce-Price-amount amount"> <bdi>49,00&nbsp;<span class="woocommerce-Price-currencySymbol">MAD</span> </bdi></span></del> <ins><span class="woocommerce-Price-amount amount"><bdi>35,00&nbsp;<span class="woocommerce-Price-currencySymbol">MAD</span></bdi></span></ins></span>

I am trying to select only the part in <ins> .

I have so far used this code to extract the price, but it makes no distinction between the crossed and actual price.

sourceCode.xpath('//span[@class="price"]/descendant::node()/text()')

I cannot figure out how to only select the <ins> part.

How about we actually follow what <del> means conceptually and remove del elements before extracting price values:

In [1]: from lxml import html

In [2]: data = """<span class="price"><del>
   ...: <span class="woocommerce-Price-amount amount">
   ...: <bdi>49,00&nbsp;<span class="woocommerce-Price-currencySymbol">MAD</span>
   ...: </bdi></span></del> 
   ...: <ins><span class="woocommerce-Price-amount amount"><bdi>35,00&nbsp;<span class="woocommerce-Price-currencySymbol">MAD</span></bdi></span></ins></span>"""

In [3]: root = html.fromstring(data)

In [4]: for del_element in root.xpath('//span[@class="price"]//del'): 
            del_element.getparent().remove(del_element)

In [5]: root.xpath('//span[@class="price"]/descendant::node()/text()')
Out[5]: ['35,00\xa0', 'MAD']

I'd argue this is likely better than trying to write XPath expressions to handle the cases where they have both old and new prices or just a single price.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM