使用 python 在 Beautiful Soup 4 中检查孩子的标签

Question

我正在使用 BeautifulSoup 4 和 python 来解析一些 HTML。 这是代码：

from bs4 import BeautifulSoup as bs
html_doc = '<p class="line-spacing-double" align="center">IN <i>THE </i><b>DISTRICT</b> COURT OF {county} COUNTY\nSTATE OF OKLAHOMA</p>'

soup = bs(html_doc, 'html.parser')
para = soup.p

for child in soup.p.children:
    print (child)

结果是：

IN
<i>THE </i>
<b>DISTRICT</b>
 COURT OF {county} COUNTY
STATE OF OKLAHOMA

这都是有道理的。 我想要做的是遍历结果，如果我找到<i>或<b>然后对它们做一些不同的事情。 当我尝试以下操作时，它不起作用：

for child in soup.p.children:
    if child.findChildren('i'):
        print('italics found')

错误是因为第一个返回的孩子是一个字符串，我正在尝试搜索它的孩子标签，而 BS4 已经知道没有孩子在场。

所以我修改了代码来检查孩子是否是一个字符串，如果是，不要试图对它采取任何行动，只需将其打印出来。

for child in soup.p.children:
    if isinstance(child, str):
        print(child)
    elif child.findAll('i'):
        for tag in child.findAll('i'):
            print(tag)

最新代码的结果：

IN
 COURT OF {county} COUNTY
STATE OF OKLAHOMA

当我遍历结果时，我需要能够检查结果中的标签，但我似乎无法弄清楚如何。 我认为这应该很简单，但我很难过。

编辑：

回应 jacalvo：

如果我跑

for child in soup.p.children:
    if child.find('i'):
        print(child)

它仍然无法从 HTML 代码中打印出第 2 行和第 3 行

编辑：

for child in soup.p.children:
    if isinstance(child, str):
        print(child)
    else:
        print(child.findChildren('i', recursive=False))

这导致：

IN
[]
[]
 COURT OF {county} COUNTY
STATE OF OKLAHOMA

Answer 1

这是您尝试使用标签“做一些不同的事情”的例子吗？ 在问题中获取完整的所需输出样本将有所帮助：

from bs4 import BeautifulSoup as bs

html_doc = '<p class="line-spacing-double" align="center">IN <i>THE</i> <b>DISTRICT</b> COURT OF {county} COUNTY\nSTATE OF OKLAHOMA</p>'
soup = bs(html_doc, 'html.parser')
para = soup.p

for child in para.children:
    if child.name == 'i':
        print(f'*{child.text}*',end='')
    elif child.name == 'b':
        print(f'**{child.text}**',end='')
    else:
        print(child,end='')

输出：

IN *THE* **DISTRICT** COURT OF {county} COUNTY
STATE OF OKLAHOMA

Answer 2

    from bs4 import BeautifulSoup as bs

    html_doc = '<p class="line-spacing-double" align="center">IN <i>THE </i><b>DISTRICT</b> COURT OF {county} ' \
               'COUNTY\nSTATE OF OKLAHOMA</p> '

    soup = bs(html_doc, 'html.parser')
    paragraph = soup.p

    # all tags dynamically gotten
    tags = [tag.name for tag in soup.find_all()]

    for child in paragraph.children:
        if child.name in tags:
            print('{0}'.format(child))  # or child.text
        else:
            print(child)

输出

    IN 
    <i>THE </i>
    <b>DISTRICT</b>
     COURT OF {county} COUNTY
    STATE OF OKLAHOMA

Answer 3

使用findChildren () 然后使用 if 条件检查子名称。

from bs4 import BeautifulSoup as bs
html_doc = '<p class="line-spacing-double" align="center">IN <i>THE </i><b>DISTRICT</b> COURT OF {county} COUNTY\nSTATE OF OKLAHOMA</p>'

soup = bs(html_doc, 'html.parser')

for child in soup.find('p').findChildren(recursive=False) :
    if child.name=='i':
        print(child)
    if child.name=='b':
        print(child)

输出：

<i>THE </i>
<b>DISTRICT</b>

使用 python 在 Beautiful Soup 4 中检查孩子的标签

问题描述

3 个解决方案

解决方案1
1 已采纳 2019-06-29 20:03:48

解决方案2
0 2019-06-29 18:40:10

解决方案3
0 2019-06-29 19:29:34

输出：

使用 python 在 Beautiful Soup 4 中检查孩子的标签

问题描述

3 个解决方案

解决方案1 1 已采纳 2019-06-29 20:03:48

解决方案2 0 2019-06-29 18:40:10

解决方案3 0 2019-06-29 19:29:34

输出：

解决方案1
1 已采纳 2019-06-29 20:03:48

解决方案2
0 2019-06-29 18:40:10

解决方案3
0 2019-06-29 19:29:34