简体   繁体   中英

Python: extract certain values by using bs4

HTML:

<div class="col-7"> 
    <dl class="row box">
        <h2>GENERAL</h2>
        <dt class="col-6">transmission:</dt>
        <dd class="col-6">sequential automatic</dd>
        <dt class="col-6 grey">number of seats:</dt>
        <dd class="col-6">5</dd>
        <dt class="col-6">first year of production:</dt>
        <dd class="col-6">2017</dd>
        <dt class="col-6 grey">last year of production:</dt>
        <dd class="col-6">available</dd>
    </dl>
        <dl class="row box">
        <h2>DRIVE</h2>
        <dt class="col-6">fuel:</dt>
        <dd class="col-6">petrol</dd>
        <dt class="col-6 grey">total maximum power:</dt>
        <dd class="col-6">147 kW (200 hp)</dd>
        <dt class="col-6">total maximum torque:</dt>
        <dd class="col-6">330 Nm</dd>
    </dl>
    <dl class="row box">
        <h2>TRANSMISSION</h2>
        <dt class="col-6">1st gear:</dt>
        <dd class="col-6">5,00:1</dd>
        <dt class="col-6 grey">2nd gear:</dt>
        <dd class="col-6">3,20:1</dd>
    </dl>
</div>

My code:

for item2 in soup2.find_all(attrs={'class':'col-7'}):
    jj=item2.text

jj can extract all the value from the website that I scraped, but I only need a few values from it. For example, I only need to extract the value of number of seats and last year of production from GENERAL and the value of 1st gear from TRANSMISSION.

The result should be:

5, available, 5,00:1

The information you need is simply the next item of the titles "number of seats", "last year of production", and "1st gear", so you can loop through the item and next item by using zip

all_items = soup.find_all(attrs={'class':'col-6'})
titles = [
    "number of seats", 
    "last year of production", 
    "1st gear"
]
d = {title: [] for title in titles}

for item, next_item in zip(all_items, all_items[1:]):
    for title in titles:
        if title in item.text:
            d[title].append(next_item.text)
            break

Then d will contain all the information you need

Change the find_values tuple to get values from html text

from bs4 import BeautifulSoup
    soup = BeautifulSoup(html, 'html.parser')
    find_values = ('number of seats', 'last year of production', '1st gear') 
    for i in soup.find_all(attrs={'class': 'row box'}):
       for j in i.find_all('dt'):
           text = j.get_text().lower().strip()
           if text.startswith(find_values):
               print(text, j.find_next_sibling('dd').get_text())

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM