如何使用美丽的汤来迭代和获取所有值来抓取网站？

Question

Here's a part of my HTML page that I parse into variable using Beautiful Soup.这是我使用 Beautiful Soup 解析为变量的 HTML 页面的一部分。 I need to extract some of the text values and insert them into table later on.我需要提取一些文本值并稍后将它们插入表中。 I need the name of the player, team and points.我需要球员、球队和积分的名字。

I can get the first player name, and the second one using next_sibling but couldn't iterate through the whole page.我可以使用 next_sibling 获取第一个玩家名称和第二个玩家名称，但无法遍历整个页面。

<h3>NBA Player Points</h3>
<br>

0089, Thu Jan 16 03:00:00 CET 2020, DEN/CHA-Murray J. (DEN)
<ul>
<li>Player Points  [Under : 1.0, Over : 1.0, OU : 0.0]</li>
<li>Player Points [Under : 1.0, Over : 1.0, OU : 0.0]</li>
<li>Player Points [Under : 1.85, Over : 1.85, OU : 18.5]</li>
<li>Player Points [Under : 1.0, Over : 1.0, OU : 0.0]</li>
<li>Index Rating [Under : 1.0, Over : 1.0, OU : 0.0]</li>
<li>Player Assists [Under : 1.0, Over : 1.0, OU : 0.0]</li>
<li>Player Rebounds [Under : 1.0, Over : 1.0, OU : 0.0]</li>
</ul>

0761, Thu Jan 16 03:00:00 CET 2020, DEN/CHA-Rozier T. (CHA)
<ul>
<li>Player Points  [Under : 1.0, Over : 1.0, OU : 0.0]</li>
<li>Player Points [Under : 1.0, Over : 1.0, OU : 0.0]</li>
<li>Player Points [Under : 1.75, Over : 1.95, OU : 18.5]</li>
<li>Player Points [Under : 1.0, Over : 1.0, OU : 0.0]</li>
<li>Index Rating [Under : 1.0, Over : 1.0, OU : 0.0]</li>
<li>Player Assists [Under : 1.0, Over : 1.0, OU : 0.0]</li>
<li>Player Rebounds [Under : 1.0, Over : 1.0, OU : 0.0]</li>
</ul>

1491, Thu Jan 16 03:00:00 CET 2020, DEN/CHA-Grant J. (DEN)
<ul>
<li>Player Points  [Under : 1.0, Over : 1.0, OU : 0.0]</li>
<li>Player Points [Under : 1.0, Over : 1.0, OU : 0.0]</li>
<li>Player Points [Under : 1.85, Over : 1.85, OU : 13.5]</li>
<li>Player Points [Under : 1.0, Over : 1.0, OU : 0.0]</li>
<li>Index Rating [Under : 1.0, Over : 1.0, OU : 0.0]</li>
<li>Player Assists [Under : 1.0, Over : 1.0, OU : 0.0]</li>
<li>Player Rebounds [Under : 1.0, Over : 1.0, OU : 0.0]</li>
</ul>

Here's what I'd like to get from this HTML:这是我想从这个 HTML 中得到的：

Player: Murray J.球员：穆雷J。

Team: DEN团队： DEN

Player Points: 18.5玩家积分： 18.5

Player: Rozier T.球员： Rozier T.

Team: CHA团队： CHA

Player Points: 18.5玩家积分： 18.5

Player: Grant J.球员：格兰特J。

Team: DEN团队： DEN

Player Points: 13.5玩家积分： 13.5

Any ideas?有任何想法吗？

Answer 1

Not the most elegant code, but it should get you there.不是最优雅的代码，但它应该能让你到达那里。 The main string manipulation tool used here is the partition() method which splits a string into 3 sub-strings around a separator.这里使用的主要字符串操作工具是partition()方法，它围绕分隔符将字符串拆分为 3 个子字符串。 From these are then stripped off unnecessary characters using the strip() and replace() methods.然后使用strip()和replace()方法从这些字符中去除不必要的字符。

from bs4 import BeautifulSoup as bs
players = """[your html above]"""

soup = bs(players,'lxml')
names = soup.select('ul')
for name in names:
    dat = name.previous.strip().partition('-')[2]
    print('Name:',dat.partition('. ')[0]+'.')
    print('Team:',dat.partition('. ')[2].replace('(','').replace(')',''))
    print('Player Points:',name.select('li')[2].text.partition(', OU : ')[2].replace(']',''))

Output:输出：

Name: Murray J.
Team: DEN
Player Points: 18.5
Name: Rozier T.
Team: CHA
Player Points: 18.5
Name: Grant J.
Team: DEN
Player Points: 13.5

如何使用美丽的汤来迭代和获取所有值来抓取网站？

问题描述

1 个解决方案

解决方案1
1 已采纳 2020-01-16 02:45:51

如何使用美丽的汤来迭代和获取所有值来抓取网站？

问题描述

1 个解决方案

解决方案1 1 已采纳 2020-01-16 02:45:51

解决方案1
1 已采纳 2020-01-16 02:45:51