如何使用Beautiful Soup 4来查找属性

Question

I'm trying to parse html like the following: 我正在尝试解析html，如下所示：

<tbody>
    <tr class data-row="0">
        <td align="right"></td>
    </tr>
    <tr class data-row="1">
        <td align="right"></td>
    </tr>
    <tr class="thead over_theader" data-row="2">
        <td align="right"></td>
    </tr>
    <tr class="thead" data-row="3">
        <td align="right"></td>
    </tr>
    <tr class data-row="4">
        <td align="right"></td>
    </tr>
    <tr class data-row="5">
        <td align="right"></td>
    </tr>
</tbody>

I want to obtain all tr tags (and their children) where class is not specified. 我想获得未指定class所有tr标签（及其子代）。 For the example above, that means I want the tr tags where data-row is not 2 or 3. 对于上面的例子，这意味着我想要data-row不是2或3的tr标签。

How do I do this using Beautiful Soup 4? 我如何使用Beautiful Soup 4做到这一点？

I tried 我试过了

tableBody = soup.findAll('tbody')
rows = tableBody[0].findAll(attrs={"class":""})

but this returned a type bs4.element.ResultSet of length 8 (ie it included the tr children with td tags) when I wanted a bs4.element.ResultSet of length 4 (one for each tr tag with class = "" ). 但是当我想要一个长度为4的bs4.element.ResultSet时（每个tr标签对应一个class = "" ），这会返回一个长度为8的类型bs4.element.ResultSet （即它包含带有td标签的tr子bs4.element.ResultSet ）。

Answer 1

Your method actually works for me when I specify the tr tag name: 当我指定tr标签名称时，您的方法实际上适用于我：

>>> from bs4 import BeautifulSoup
>>> data = """
... <tbody>
...     <tr class data-row="0">
...         <td align="right"></td>
...     </tr>
...     <tr class data-row="1">
...         <td align="right"></td>
...     </tr>
...     <tr class="thead over_theader" data-row="2">
...         <td align="right"></td>
...     </tr>
...     <tr class="thead" data-row="3">
...         <td align="right"></td>
...     </tr>
...     <tr class data-row="4">
...         <td align="right"></td>
...     </tr>
...     <tr class data-row="5">
...         <td align="right"></td>
...     </tr>
... </tbody>
... """
>>> soup = BeautifulSoup(data, "html.parser")
>>> len(soup.find_all("tr", class_=""))
4

Alternatively, you can use a tr[class=""] CSS selector : 或者，您可以使用tr[class=""] CSS选择器：

>>> len(soup.select('tr[class=""]'))
4

Answer 2

find_all will, by default, search recursively. 默认情况下， find_all将递归搜索。 So the td tags are valid matches. 所以td标签是有效的匹配。

Docs : 文件：

If you call mytag.find_all() , Beautiful Soup will examine all the descendants of mytag : its children, its children's children, and so on. 如果你打电话给mytag.find_all() ，Beautiful Soup会检查mytag所有后代：它的子女，孩子的孩子等等。 If you only want Beautiful Soup to consider direct children, you can pass in recursive=False 如果你只想要美丽的汤来考虑直接孩子，你可以传递recursive=False

So you might write, for example: 所以你可能会写，例如：

tableBody = soup.findAll('tbody')
rows = tableBody[0].find_all(attrs={"class":""}, recursive=False)

print(len(rows))
for r in rows:
    print('---')
    print(r)

Output: 输出：

4
---
<tr class="" data-row="0">
<td align="right"></td>
</tr>
---
<tr class="" data-row="1">
<td align="right"></td>
</tr>
---
<tr class="" data-row="4">
<td align="right"></td>
</tr>
---
<tr class="" data-row="5">
<td align="right"></td>
</tr>

如何使用Beautiful Soup 4来查找属性

问题描述

2 个解决方案

解决方案1
0 2016-04-23 00:21:23

解决方案2
0 2016-04-23 00:29:41

如何使用Beautiful Soup 4来查找属性

问题描述

2 个解决方案

解决方案1 0 2016-04-23 00:21:23

解决方案2 0 2016-04-23 00:29:41

解决方案1
0 2016-04-23 00:21:23

解决方案2
0 2016-04-23 00:29:41