简体   繁体   中英

Extract data when there is no unique class in BeautifulSoup

This is how a part of the HTML code looks like.

<div class="field-wrapper field field-node--field-test-synonyms field-name-field-test-synonyms field-type-string field-label-inline clearfix">
    <div class="field-label">Also Known As</div>
    <div class="field-items">
          <div class="field-item">17-OHP</div>
          <div class="field-item">17-OH Progesterone </div>
      </div>
</div>

What I am trying to do is extract the two words 17-OHP and 17-OH Progesterone .

My Code

sub_url = "https://labtestsonline.org/tests/17-hydroxyprogesterone"
response = requests.get(sub_url)
soup = BeautifulSoup(response.content, 'lxml' )
other_names = []
table = soup.findAll('div',attrs={"class":"field_items"})
    print(x.text)
    other_names.append(x.text)

But the problem is the class field-items is used so many places in the web page. So I get lots of unexpected words. Please help me how to find an unique tag in this case. The output I expect is other_names = ['17-OHP','17-OH Progesterone']

Thank You.

You can search for a class named field-label , and than call .next :

import requests
from bs4 import BeautifulSoup

sub_url = "https://labtestsonline.org/tests/17-hydroxyprogesterone"
response = requests.get(sub_url)

soup = BeautifulSoup(response.content, 'html.parser')

other_names = [
    tag.next.next.get_text(strip=True, separator='|').split('|')
    for tag in soup.find('div', class_='field-label')
]
print(*other_names)

Output:

['17-OHP', '17-OH Progesterone']

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM