简体   繁体   中英

Get specific text between tag - Python Beautifulsoup

I'm trying to get a specific text which is "D1. AGE". I'm using the print(soup.find('tr',{'class':'subjectHeadRow'}).text) method. However, this gives me the following text: D1. AGE Universe: Total population Reference tables: B01001 B16001 B09020 D1. AGE Universe: Total population Reference tables: B01001 B16001 B09020 What is the best way to get the text "D1. AGE" only?

<tr class='subjectHeadRow'><th colspan='7'>D1. AGE<a href='./charts.php?p=37&g=05000US36003|04000US36|01000US&c=1' target='_blank' title='Chart data'><img src='/apps/elements/images/chart.png' class='iconButton noPrint' alt=''/></a>
<p class='subjectMeta'>Universe: Total population</p>
<p class='subjectMeta'>Reference tables: <a href='http://factfinder2.census.gov/bkmk/table/1.0/en/ACS/18_5YR/B01001/0500000US36003|0400000US36|0100000US' target='_blank'>B01001</a> <a href='http://factfinder2.census.gov/bkmk/table/1.0/en/ACS/18_5YR/B16001/0500000US36003|0400000US36|0100000US' target='_blank'>B16001</a> <a href='http://factfinder2.census.gov/bkmk/table/1.0/en/ACS/18_5YR/B09020/0500000US36003|0400000US36|0100000US' target='_blank'>B09020</a> </p></th></tr>

Another question: I want to search through an entire page to find all class types with a td tag, what would be the best way to achieve this? For instance in case where my page has the tags below and I want to return values ['indent0', 'value moeLow', 'value moeHigh']

<td class='indent0' title='TotPop'>Total population</td>
<td></td>
<td class='value moeLow' title='+/- 0.00% (47025, 47025)'>47,025</td>
<td class='value moeHigh' title='+/- 0.00% (19618452, 19618452)'>19,618,452</td>
<td></td> 

To get the value D1. AGE D1. AGE use find_next() after finding the element.then use contents[0]

html='''<tr class='subjectHeadRow'><th colspan='7'>D1. AGE<a href='./charts.php?p=37&g=05000US36003|04000US36|01000US&c=1' target='_blank' title='Chart data'><img src='/apps/elements/images/chart.png' class='iconButton noPrint' alt=''/></a>
<p class='subjectMeta'>Universe: Total population</p>
<p class='subjectMeta'>Reference tables: <a href='http://factfinder2.census.gov/bkmk/table/1.0/en/ACS/18_5YR/B01001/0500000US36003|0400000US36|0100000US' target='_blank'>B01001</a> <a href='http://factfinder2.census.gov/bkmk/table/1.0/en/ACS/18_5YR/B16001/0500000US36003|0400000US36|0100000US' target='_blank'>B16001</a> <a href='http://factfinder2.census.gov/bkmk/table/1.0/en/ACS/18_5YR/B09020/0500000US36003|0400000US36|0100000US' target='_blank'>B09020</a> </p></th></tr>'''

soup=BeautifulSoup(html,"html.parser")
print(soup.find('tr',{'class':'subjectHeadRow'}).find_next('th').contents[0])

For the second example use class=True or css selector and then join the string.

html='''<td class='indent0' title='TotPop'>Total population</td>
<td></td>
<td class='value moeLow' title='+/- 0.00% (47025, 47025)'>47,025</td>
<td class='value moeHigh' title='+/- 0.00% (19618452, 19618452)'>19,618,452</td>
<td></td> '''
soup=BeautifulSoup(html,"html.parser")


tds=[' '.join(td['class']) for td in soup.find_all('td' , class_=True)]
print(tds)

# OR Css selector
tds=[' '.join(td['class']) for td in soup.select('td[class]')]
print(tds)

Output :

['indent0', 'value moeLow', 'value moeHigh']
['indent0', 'value moeLow', 'value moeHigh']

Looks like it is a child element so use a child combinator > to get child th of parent element with class subjectHeadRow then use stripped strings to get the string of interest with index 0

from bs4 import BeautifulSoup as bs

html = '''<table>
<tr class='subjectHeadRow'><th colspan='7'>D1. AGE<a href='./charts.php?p=37&g=05000US36003|04000US36|01000US&c=1' target='_blank' title='Chart data'><img src='/apps/elements/images/chart.png' class='iconButton noPrint' alt=''/></a>
<p class='subjectMeta'>Universe: Total population</p>
<p class='subjectMeta'>Reference tables: <a href='http://factfinder2.census.gov/bkmk/table/1.0/en/ACS/18_5YR/B01001/0500000US36003|0400000US36|0100000US' target='_blank'>B01001</a> <a href='http://factfinder2.census.gov/bkmk/table/1.0/en/ACS/18_5YR/B16001/0500000US36003|0400000US36|0100000US' target='_blank'>B16001</a> <a href='http://factfinder2.census.gov/bkmk/table/1.0/en/ACS/18_5YR/B09020/0500000US36003|0400000US36|0100000US' target='_blank'>B09020</a> </p></th></tr></table>
'''

soup = bs(html, 'lxml')
[string for string in soup.select_one('.subjectHeadRow th').stripped_strings][0]

Or use a generator and call once

gen = soup.select_one('.subjectHeadRow th').stripped_strings
next(gen)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM