简体   繁体   中英

Scrape table element by row name using beautifulsoup

Here is the html that I want to scrape :

<dl class="some class">
    <dt> <strong>Text1</strong></dt>
    <dd> Result1</dd>
    <dt> <strong>Text2</strong></dt>
    <dd> Result2</dd>
    <dt> <strong>Text3</strong></dt>
    <dd> Result3</dd>
    <dt> <strong>Text4</strong></dt>
    <dd> Result4</dd>
    .  .  .
</dl>

What I want is to get the Result3 right next to Text3 . In selenium, I would do this by:

parent=driver.find_element_by_css_selector("dl.BuyingOptions-labeledValues")
elem=parent.find_element_by_xpath("//dt[contains(.,'Text3')]/following::dd[1]")

I want to use beautifulsoup for the same thing now. I first tried:

parent=soup.find("dl","BuyingOptions-labeledValues")

which is working fine and print(parent.text) gets all the table text. Then I tried:

elem = parent.find("dt",string='Country Of Origin')

This is not working. Please can someone help. I am new to beautifulsoup

You can use CSS Selector with bs4 4.7.1+ dt:contains("Text3") + dd . This will select <dd> that is places immediately after <dt> that contains text "Text3" :

data = '''
<dl class="some class">
    <dt> <strong>Text1</strong></dt>
    <dd> Result1</dd>
    <dt> <strong>Text2</strong></dt>
    <dd> Result2</dd>
    <dt> <strong>Text3</strong></dt>
    <dd> Result3</dd>
    <dt> <strong>Text4</strong></dt>
    <dd> Result4</dd>
</dl>'''

from bs4 import BeautifulSoup

soup = BeautifulSoup(data, 'html.parser')

print( soup.select_one('dt:contains("Text3") + dd').get_text(strip=True) )

Prints:

Result3

Further reading:

CSS Selectors Reference


Another method (using bs4 filtering):

print( soup.find(lambda t: t.name=='dt' and t.text.strip()=='Text3').find_next_sibling() )

Prints:

<dd> Result3</dd>

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM