简体   繁体   English

在python BS4中提取某些文本?

[英]Extracting certain text in python BS4?

I am trying to extract certain text in BS4.我正在尝试在 BS4 中提取某些文本。 Sample HTML below.下面的示例 HTML。

 </tr><tr id="_Gonzaga" class="seedrow"> <td title="Click to show/hide ranks" class='lowrowclick' style="text-align:center;font-size:8px">2</td> <td id='Gonzaga' class="teamname"><a href="team.php?team=Gonzaga&year=2019" style="text-decoration: none;">Gonzaga<span class="lowrow" style="font-size:10px"><br/>&nbsp;&nbsp;&nbsp;1 seed, <span style='background-color:#BAE2C6'>Elite Eight</span></span></a></td>

current code is:当前代码是:

data = soup.findAll('tr', attrs={"class": "seedrow"})
team_name = item.find('td', class_ = 'teamname')
team_id = team_name.find('a').contents[0]
seed = team_name.find('span').text
print(team_id, seed)

This returns:这将返回:

Gonzaga, '\xa0\xa0\xa01 seed, Elite Eight'

What I want:我想要的是:

Gonzaga, 1 seed, Elite Eight

If I understand you right, you want to extract 3 separated strings.如果我理解正确,您想提取 3 个分隔的字符串。 You can use .get_text() with custom separator= character and then split on this character:您可以将.get_text()与自定义separator=字符一起使用,然后在此字符上拆分:

from bs4 import BeautifulSoup


txt = '''
<tr id="_Gonzaga" class="seedrow">
<td title="Click to show/hide ranks" class='lowrowclick' style="text-align:center;font-size:8px">2</td>
<td  id='Gonzaga' class="teamname"><a href="team.php?team=Gonzaga&year=2019" style="text-decoration: none;">Gonzaga<span class="lowrow" style="font-size:10px"><br/>&nbsp;&nbsp;&nbsp;1 seed, <span style='background-color:#BAE2C6'>Elite Eight</span></span></a></td>
</tr>'''

soup = BeautifulSoup(txt, 'html.parser')
data = soup.findAll('tr', attrs={"class": "seedrow"})

for item in data:
    team_name = item.find('td', class_ = 'teamname')

    a, b, c = team_name.get_text(strip=True, separator='|').split('|')

    print(a)
    print(b.strip(','))
    print(c)

Prints:印刷:

Gonzaga
1 seed
Elite Eight

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM