BeautifulSoup + Python（从页面源代码中提取特定的 HTML 标签）

Question

我有以下 HTML 代码：

<h3>Some Heading Text Here 1</h3>
<p>Some paragraph text here</p>
<p>Some paragraph text here</p>
<p>Some paragraph text here</p>
<ul>
    <li>Item 1</li>
    <li>Item 2</li>
    <li>Item 3</li>
    <li>Item 4</li>
</ul>
<h3>Some Heading Text Here 2</h3>
<p>Some paragraph text here</p>
<p>Some paragraph text here</p>
<h4>Q1</h4>
<p>A1</p>
<h4>Q2</h4>
<p>A2</p>
<h4>Q3</h4>
<p>A3</p>

我想从第一个<h3>及其子级中提取 HTML 直到第一次出现<h4>标记。

预期 Output：

<h3>Some Heading Text Here 1</h3>
<p>Some paragraph text here</p>
<p>Some paragraph text here</p>
<p>Some paragraph text here</p>
<ul>
    <li>Item 1</li>
    <li>Item 2</li>
    <li>Item 3</li>
    <li>Item 4</li>
</ul>
<h3>Some Heading Text Here 2</h3>
<p>Some paragraph text here</p>
<p>Some paragraph text here</p>

Answer 1

我尝试了以下方法，结果如下：

from bs4 import BeautifulSoup

data = """<h3>Some Heading Text Here 1</h3>
<p>Some paragraph text here</p>
<p>Some paragraph text here</p>
<p>Some paragraph text here</p>
<ul>
    <li>Item 1</li>
    <li>Item 2</li>
    <li>Item 3</li>
    <li>Item 4</li>
</ul>
<h3>Some Heading Text Here 2</h3>
<p>Some paragraph text here</p>
<p>Some paragraph text here</p>
<h4>Q1</h4>
<p>A1</p>
<h4>Q2</h4>
<p>A2</p>
<h4>Q3</h4>
<p>A3</p>"""

soup = BeautifulSoup(data)

tags = soup.find_all('h3')

text = ""
for i in tags:
    # print(i)
    text = text+str(i)
    for x in i.next_siblings:
        
        if x.name == 'h4':
            break
        else:
            text = text+str(x)
print(text)

Output：

<h3>Some Heading Text Here 1</h3>
<p>Some paragraph text here</p>
<p>Some paragraph text here</p>
<p>Some paragraph text here</p>
<ul>
<li>Item 1</li>
<li>Item 2</li>
<li>Item 3</li>
<li>Item 4</li>
</ul>
<h3>Some Heading Text Here 2</h3>
<p>Some paragraph text here</p>
<p>Some paragraph text here</p>
<h3>Some Heading Text Here 2</h3>
<p>Some paragraph text here</p>
<p>Some paragraph text here</p>

BeautifulSoup + Python（从页面源代码中提取特定的 HTML 标签）

问题描述

1 个解决方案

解决方案1
0 2021-04-13 04:12:20

BeautifulSoup + Python（从页面源代码中提取特定的 HTML 标签）

问题描述

1 个解决方案

解决方案1 0 2021-04-13 04:12:20

解决方案1
0 2021-04-13 04:12:20