简体   繁体   中英

Using REGEX to match elements between lines in Python

I'm looking to use REGEX to extract quantity out of a shopping website. In the following example, I want to get "12.5 kilograms". However, the quantity within the first span is not always listed in kilograms; it could be lbs., oz., etc.

        <td class="size-price last first" colspan="4">
                    <span>12.5 kilograms </span>
            <span> <span class="strike">$619.06</span> <span class="price">$523.91</span>
                    </span>
                </td>

The code above is only a small portion of what is actually extracted using BeautifulSoup. Whatever the page is, the quantity is always within a span and is on a new line after

<td class="size-price last first" colspan="4">  

I've used REGEX in the past but I am far from an expert. I'd like to know how to match elements between different lines. In this case between

<td class="size-price last first" colspan="4">

and

<span> <span class="strike">

Avoid parsing HTML with regex. Use the tool for the job, an HTML parser, like BeautifulSoup - it is powerful, easy to use and it can perfectly handle your case:

from bs4 import BeautifulSoup


data = """
<td class="size-price last first" colspan="4">
                    <span>12.5 kilograms </span>
            <span> <span class="strike">$619.06</span> <span class="price">$523.91</span>
                    </span>
                </td>"""
soup = BeautifulSoup(data)

print soup.td.span.text

prints:

12.5 kilograms 

Or, if the td is a part of a bigger structure, find it by class and get the first span's text out of it:

print soup.find('td', {'class': 'size-price'}).span.text

UPD (handling multiple results):

print [td.span.text for td in soup.find_all('td', {'class': 'size-price'})]

Hope that helps.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM