[英]BeautifulSoup: How to get the non-comment-content of a class with comments?
I'm trying to get data from a webpage using BeautifulSoup
. 我正在尝试使用
BeautifulSoup
从网页获取数据。 It works fine for the most data, but one class seems to work different and I can't figure out what to to. 它适用于大多数数据,但一个类的工作原理似乎有所不同,我不知道该怎么办。 Are comments maybe affecting
soup.find_all
? 评论可能会影响
soup.find_all
吗?
So I have a webpage with several classes having the same name and I'm finding the contents with soup.find_all
. 因此,我有一个网页,其中包含多个具有相同名称的类,并且正在使用
soup.find_all
查找内容。 While this works for the class "points column"
, which always looks like this: 尽管这适用于
class "points column"
,但始终如下所示:
<div class="points column">Punkte</div>
<div class="points column">45.677</div>
<div class="points column">43.445</div>
...
It doesn't work for the class "teamValue column"
, which looks like this: 对于
class "teamValue column"
的class "teamValue column"
,如下所示:
<div class="teamValue column">Teamwert</div>
<div class="teamValue column">
<!-- react-text: 690 -->
554,4
<!-- /react-text -->
<!-- react-text: 691 -->
€
<!-- /react-text -->
</div>
<div class="teamValue column">
<!-- react-text: 705 -->
449,7
<!-- /react-text -->
<!-- react-text: 706 -->
€
<!-- /react-text -->
</div>
...
This is my code: 这是我的代码:
def getplayerdata(self):
bot = self.bot
soup = BeautifulSoup(bot.page_source, 'html.parser')
playervalue = soup.find_all("div",class_="teamValue column",text=True)
playerpoints = soup.find_all("div",class_="points column",text=True)
print(playervalue)
print(playerpoints)
The output for playerpoints
works as expected, I get all the data and can extract only the text with the .string
command. playerpoints
的输出按预期工作,我获得了所有数据,并且只能使用.string
命令提取文本。
But for playervalue
I only get one element in my list, which is: 但是对于
playervalue
我在列表中仅得到一个元素,即:
[<div class="teamValue column">Teamwert</div>]
I can get this text if I use find_all()
without text=True
and .get_text()
or .text
instead of .string
如果我用我能得到这个文本
find_all()
无text=True
和.get_text()
或.text
代替.string
from bs4 import BeautifulSoup as BS
text = '''<div class="teamValue column">Teamwert</div>
<div class="teamValue column">
<!-- react-text: 690 -->
554,4
<!-- /react-text -->
<!-- react-text: 691 -->
€
<!-- /react-text -->
</div>
<div class="teamValue column">
<!-- react-text: 705 -->
449,7
<!-- /react-text -->
<!-- react-text: 706 -->
€
<!-- /react-text -->
</div>'''
soup = BS(text, 'html.parser')
all_items = soup.find_all('div',class_="teamValue column") #text=True)
for item in all_items:
print('1>', item.text)
for item in all_items:
print('2>', item.get_text(strip=True, separator=' '))
for item in all_items:
print('3>', item.string)
Result: 结果:
1> Teamwert
1>
554,4
€
1>
449,7
€
2> Teamwert
2> 554,4 €
2> 449,7 €
3> Teamwert
3> None
3> None
Just change text= False
:) 只需更改
text= False
:)
playervalue = soup.find_all("div",class_="teamValue column",text=False)
print(len(playervalue))
Out: 出:
3
You could use soup.select and re.sub to get rid of the new lines 您可以使用soup.select和re.sub摆脱新行
from bs4 import BeautifulSoup
import re
html = '''
<div class="teamValue column">Teamwert</div>
<div class="teamValue column">
<!-- react-text: 690 -->
554,4
<!-- /react-text -->
<!-- react-text: 691 -->
€
<!-- /react-text -->
</div>
<div class="teamValue column">
<!-- react-text: 705 -->
449,7
<!-- /react-text -->
<!-- react-text: 706 -->
€
<!-- /react-text -->
</div>'''
soup = bs(html, 'lxml')
team_values = [re.sub('\n+', '',item.text) for item in soup.select('.teamValue.column')]
print(team_values)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.