BeautifulSoup：如何获取带有注释的课程的非注释内容？

Question

I'm trying to get data from a webpage using BeautifulSoup . 我正在尝试使用BeautifulSoup从网页获取数据。 It works fine for the most data, but one class seems to work different and I can't figure out what to to. 它适用于大多数数据，但一个类的工作原理似乎有所不同，我不知道该怎么办。 Are comments maybe affecting soup.find_all ? 评论可能会影响soup.find_all吗？

So I have a webpage with several classes having the same name and I'm finding the contents with soup.find_all . 因此，我有一个网页，其中包含多个具有相同名称的类，并且正在使用soup.find_all查找内容。 While this works for the class "points column" , which always looks like this: 尽管这适用于class "points column" ，但始终如下所示：

<div class="points column">Punkte</div>
<div class="points column">45.677</div>
<div class="points column">43.445</div>
...

It doesn't work for the class "teamValue column" , which looks like this: 对于class "teamValue column"的class "teamValue column" ，如下所示：

<div class="teamValue column">Teamwert</div>
<div class="teamValue column">
<!-- react-text: 690 -->
554,4
<!-- /react-text -->
<!-- react-text: 691 -->
 €
<!-- /react-text -->
</div>
<div class="teamValue column">
<!-- react-text: 705 -->
449,7
<!-- /react-text -->
<!-- react-text: 706 -->
 €
<!-- /react-text -->
</div>
...

This is my code: 这是我的代码：

def getplayerdata(self):
    bot = self.bot
    soup = BeautifulSoup(bot.page_source, 'html.parser')

    playervalue = soup.find_all("div",class_="teamValue column",text=True)
    playerpoints = soup.find_all("div",class_="points column",text=True)

    print(playervalue)
    print(playerpoints)

The output for playerpoints works as expected, I get all the data and can extract only the text with the .string command. playerpoints的输出按预期工作，我获得了所有数据，并且只能使用.string命令提取文本。

But for playervalue I only get one element in my list, which is: 但是对于playervalue我在列表中仅得到一个元素，即：

[<div class="teamValue column">Teamwert</div>]

Answer 1

I can get this text if I use find_all() without text=True and .get_text() or .text instead of .string 如果我用我能得到这个文本find_all()无text=True和.get_text()或.text代替.string

from bs4 import BeautifulSoup as BS

text = '''<div class="teamValue column">Teamwert</div>
<div class="teamValue column">
<!-- react-text: 690 -->
554,4
<!-- /react-text -->
<!-- react-text: 691 -->
 €
<!-- /react-text -->
</div>
<div class="teamValue column">
<!-- react-text: 705 -->
449,7
<!-- /react-text -->
<!-- react-text: 706 -->
 €
<!-- /react-text -->
</div>'''

soup = BS(text, 'html.parser')

all_items = soup.find_all('div',class_="teamValue column") #text=True)


for item in all_items:
    print('1>', item.text)

for item in all_items:
    print('2>', item.get_text(strip=True, separator=' '))

for item in all_items:
    print('3>', item.string)

Result: 结果：

1> Teamwert
1> 

554,4


 €


1> 

449,7


 €


2> Teamwert
2> 554,4 €
2> 449,7 €
3> Teamwert
3> None
3> None

Answer 2

Just change text= False :) 只需更改text= False :)

playervalue = soup.find_all("div",class_="teamValue column",text=False)
print(len(playervalue))

Out: 出：

Answer 3

You could use soup.select and re.sub to get rid of the new lines 您可以使用soup.select和re.sub摆脱新行

from bs4 import BeautifulSoup
import re

html = '''
<div class="teamValue column">Teamwert</div>
<div class="teamValue column">
<!-- react-text: 690 -->
554,4
<!-- /react-text -->
<!-- react-text: 691 -->
 €
<!-- /react-text -->
</div>
<div class="teamValue column">
<!-- react-text: 705 -->
449,7
<!-- /react-text -->
<!-- react-text: 706 -->
 €
<!-- /react-text -->
</div>'''

soup = bs(html, 'lxml')
team_values = [re.sub('\n+', '',item.text) for item in soup.select('.teamValue.column')]
print(team_values)

BeautifulSoup：如何获取带有注释的课程的非注释内容？

问题描述

3 个解决方案

解决方案1
0 已采纳 2019-07-21 22:47:40

解决方案2
0 2019-07-21 22:48:11

解决方案3
0 2019-07-22 01:19:52

BeautifulSoup：如何获取带有注释的课程的非注释内容？

问题描述

3 个解决方案

解决方案1 0 已采纳 2019-07-21 22:47:40

解决方案2 0 2019-07-21 22:48:11

解决方案3 0 2019-07-22 01:19:52

解决方案1
0 已采纳 2019-07-21 22:47:40

解决方案2
0 2019-07-21 22:48:11

解决方案3
0 2019-07-22 01:19:52