简体   繁体   中英

How can I simplify this python regex code?

I'm sure there is a better way of cleaning up a section of my web scrape. Can someone walk me through it?

#Query:[<div class="price">
<span class="price-currency">$</span>
<label for="low-price" hidden="">Low Price</label>
<input class="price-filter" data-val="true" data-val-number="The field LowPrice must be a number." data-val-required="The LowPrice field is required." id="low-price" name="SearchCriteria.LowPrice" placeholder="Min" type="text" value="0.00">
<span class="price-currency">$</span>
<label for="high-price" hidden="">Low Price</label>
<input class="price-filter" data-val="true" data-val-number="The field HighPrice must be a number." data-val-required="The HighPrice field is required." id="high-price" name="SearchCriteria.HighPrice" placeholder="Max" type="text" value="999999.00">
</input></input></div>, <div class="price">
$1,001.00                                    </div>]

prices = soup.find_all("div", {"class": "price"})

for price in prices:
    cleanPrice = price.text
    finalPrice = re.sub(r"\s\s+", " ", cleanPrice)
    finalPrice2 = re.sub(r"Low Price", "", finalPrice)
    finalPrice3 = re.sub(r"\n", "", finalPrice2)
    finalPrice4 = re.sub(r" ", "", finalPrice3)
    finalPrice5 = re.sub(r"\s\w", "", finalPrice4)
    finalPrice6 = re.sub(r"\s*$", "", finalPrice5)
    finalPrice7 = re.sub(r"\$\$", "", finalPrice6)
    pricevalues.append(finalPrice7)

You can pass in a text argument:

import re
from bs4 import BeautifulSoup

html_doc = """#Query:[<div class="price">
<span class="price-currency">$</span>
<label for="low-price" hidden="">Low Price</label>
<input class="price-filter" data-val="true" data-val-number="The field LowPrice must be a number." data-val-required="The LowPrice field is required." id="low-price" name="SearchCriteria.LowPrice" placeholder="Min" type="text" value="0.00">
<span class="price-currency">$</span>
<label for="high-price" hidden="">Low Price</label>
<input class="price-filter" data-val="true" data-val-number="The field HighPrice must be a number." data-val-required="The HighPrice field is required." id="high-price" name="SearchCriteria.HighPrice" placeholder="Max" type="text" value="999999.00">
</input></input></div>, <div class="price">
$1,001.00                                    </div>]"""

soup = BeautifulSoup(html_doc, 'html.parser')
prices = soup.find_all("div", {"class": "price"}, text=re.compile('1,001.00'))

print(prices[0].text.strip())

Outputs:

$1,001.00

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM