简体   繁体   English

如何使用Beautiful Soup 4来查找属性

[英]How to use Beautiful Soup 4 to find attribute

I'm trying to parse html like the following: 我正在尝试解析html,如下所示:

<tbody>
    <tr class data-row="0">
        <td align="right"></td>
    </tr>
    <tr class data-row="1">
        <td align="right"></td>
    </tr>
    <tr class="thead over_theader" data-row="2">
        <td align="right"></td>
    </tr>
    <tr class="thead" data-row="3">
        <td align="right"></td>
    </tr>
    <tr class data-row="4">
        <td align="right"></td>
    </tr>
    <tr class data-row="5">
        <td align="right"></td>
    </tr>
</tbody>

I want to obtain all tr tags (and their children) where class is not specified. 我想获得未指定class所有tr标签(及其子代)。 For the example above, that means I want the tr tags where data-row is not 2 or 3. 对于上面的例子,这意味着我想要data-row不是2或3的tr标签。

How do I do this using Beautiful Soup 4? 我如何使用Beautiful Soup 4做到这一点?

I tried 我试过了

tableBody = soup.findAll('tbody')
rows = tableBody[0].findAll(attrs={"class":""})

but this returned a type bs4.element.ResultSet of length 8 (ie it included the tr children with td tags) when I wanted a bs4.element.ResultSet of length 4 (one for each tr tag with class = "" ). 但是当我想要一个长度为4的bs4.element.ResultSet时(每个tr标签对应一个class = "" ),这会返回一个长度为8的类型bs4.element.ResultSet (即它包含带有td标签的trbs4.element.ResultSet )。

Your method actually works for me when I specify the tr tag name: 当我指定tr标签名称时,您的方法实际上适用于我:

>>> from bs4 import BeautifulSoup
>>> data = """
... <tbody>
...     <tr class data-row="0">
...         <td align="right"></td>
...     </tr>
...     <tr class data-row="1">
...         <td align="right"></td>
...     </tr>
...     <tr class="thead over_theader" data-row="2">
...         <td align="right"></td>
...     </tr>
...     <tr class="thead" data-row="3">
...         <td align="right"></td>
...     </tr>
...     <tr class data-row="4">
...         <td align="right"></td>
...     </tr>
...     <tr class data-row="5">
...         <td align="right"></td>
...     </tr>
... </tbody>
... """
>>> soup = BeautifulSoup(data, "html.parser")
>>> len(soup.find_all("tr", class_=""))
4

Alternatively, you can use a tr[class=""] CSS selector : 或者,您可以使用tr[class=""] CSS选择器

>>> len(soup.select('tr[class=""]'))
4

find_all will, by default, search recursively. 默认情况下, find_all将递归搜索。 So the td tags are valid matches. 所以td标签是有效的匹配。

Docs : 文件

If you call mytag.find_all() , Beautiful Soup will examine all the descendants of mytag : its children, its children's children, and so on. 如果你打电话给mytag.find_all() ,Beautiful Soup会检查mytag所有后代:它的子女,孩子的孩子等等。 If you only want Beautiful Soup to consider direct children, you can pass in recursive=False 如果你只想要美丽的汤来考虑直接孩子,你可以传递recursive=False

So you might write, for example: 所以你可能会写,例如:

tableBody = soup.findAll('tbody')
rows = tableBody[0].find_all(attrs={"class":""}, recursive=False)

print(len(rows))
for r in rows:
    print('---')
    print(r)

Output: 输出:

4
---
<tr class="" data-row="0">
<td align="right"></td>
</tr>
---
<tr class="" data-row="1">
<td align="right"></td>
</tr>
---
<tr class="" data-row="4">
<td align="right"></td>
</tr>
---
<tr class="" data-row="5">
<td align="right"></td>
</tr>

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM