[英]How to use Beautiful Soup 4 to find attribute
I'm trying to parse html like the following: 我正在尝试解析html,如下所示:
<tbody>
<tr class data-row="0">
<td align="right"></td>
</tr>
<tr class data-row="1">
<td align="right"></td>
</tr>
<tr class="thead over_theader" data-row="2">
<td align="right"></td>
</tr>
<tr class="thead" data-row="3">
<td align="right"></td>
</tr>
<tr class data-row="4">
<td align="right"></td>
</tr>
<tr class data-row="5">
<td align="right"></td>
</tr>
</tbody>
I want to obtain all tr
tags (and their children) where class
is not specified. 我想获得未指定
class
所有tr
标签(及其子代)。 For the example above, that means I want the tr
tags where data-row
is not 2 or 3. 对于上面的例子,这意味着我想要
data-row
不是2或3的tr
标签。
How do I do this using Beautiful Soup 4? 我如何使用Beautiful Soup 4做到这一点?
I tried 我试过了
tableBody = soup.findAll('tbody')
rows = tableBody[0].findAll(attrs={"class":""})
but this returned a type bs4.element.ResultSet
of length 8 (ie it included the tr
children with td
tags) when I wanted a bs4.element.ResultSet
of length 4 (one for each tr
tag with class = ""
). 但是当我想要一个长度为4的
bs4.element.ResultSet
时(每个tr
标签对应一个class = ""
),这会返回一个长度为8的类型bs4.element.ResultSet
(即它包含带有td
标签的tr
子bs4.element.ResultSet
)。
Your method actually works for me when I specify the tr
tag name: 当我指定
tr
标签名称时,您的方法实际上适用于我:
>>> from bs4 import BeautifulSoup
>>> data = """
... <tbody>
... <tr class data-row="0">
... <td align="right"></td>
... </tr>
... <tr class data-row="1">
... <td align="right"></td>
... </tr>
... <tr class="thead over_theader" data-row="2">
... <td align="right"></td>
... </tr>
... <tr class="thead" data-row="3">
... <td align="right"></td>
... </tr>
... <tr class data-row="4">
... <td align="right"></td>
... </tr>
... <tr class data-row="5">
... <td align="right"></td>
... </tr>
... </tbody>
... """
>>> soup = BeautifulSoup(data, "html.parser")
>>> len(soup.find_all("tr", class_=""))
4
Alternatively, you can use a tr[class=""]
CSS selector : 或者,您可以使用
tr[class=""]
CSS选择器 :
>>> len(soup.select('tr[class=""]'))
4
find_all
will, by default, search recursively. 默认情况下,
find_all
将递归搜索。 So the td
tags are valid matches. 所以
td
标签是有效的匹配。
If you call
mytag.find_all()
, Beautiful Soup will examine all the descendants ofmytag
: its children, its children's children, and so on.如果你打电话给
mytag.find_all()
,Beautiful Soup会检查mytag
所有后代:它的子女,孩子的孩子等等。 If you only want Beautiful Soup to consider direct children, you can pass inrecursive=False
如果你只想要美丽的汤来考虑直接孩子,你可以传递
recursive=False
So you might write, for example: 所以你可能会写,例如:
tableBody = soup.findAll('tbody')
rows = tableBody[0].find_all(attrs={"class":""}, recursive=False)
print(len(rows))
for r in rows:
print('---')
print(r)
Output: 输出:
4
---
<tr class="" data-row="0">
<td align="right"></td>
</tr>
---
<tr class="" data-row="1">
<td align="right"></td>
</tr>
---
<tr class="" data-row="4">
<td align="right"></td>
</tr>
---
<tr class="" data-row="5">
<td align="right"></td>
</tr>
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.