BeautifulSoup + Python（從頁面源代碼中提取特定的 HTML 標簽）

Question

我有以下 HTML 代碼：

<h3>Some Heading Text Here 1</h3>
<p>Some paragraph text here</p>
<p>Some paragraph text here</p>
<p>Some paragraph text here</p>
<ul>
    <li>Item 1</li>
    <li>Item 2</li>
    <li>Item 3</li>
    <li>Item 4</li>
</ul>
<h3>Some Heading Text Here 2</h3>
<p>Some paragraph text here</p>
<p>Some paragraph text here</p>
<h4>Q1</h4>
<p>A1</p>
<h4>Q2</h4>
<p>A2</p>
<h4>Q3</h4>
<p>A3</p>

我想從第一個<h3>及其子級中提取 HTML 直到第一次出現<h4>標記。

預期 Output：

<h3>Some Heading Text Here 1</h3>
<p>Some paragraph text here</p>
<p>Some paragraph text here</p>
<p>Some paragraph text here</p>
<ul>
    <li>Item 1</li>
    <li>Item 2</li>
    <li>Item 3</li>
    <li>Item 4</li>
</ul>
<h3>Some Heading Text Here 2</h3>
<p>Some paragraph text here</p>
<p>Some paragraph text here</p>

Answer 1

我嘗試了以下方法，結果如下：

from bs4 import BeautifulSoup

data = """<h3>Some Heading Text Here 1</h3>
<p>Some paragraph text here</p>
<p>Some paragraph text here</p>
<p>Some paragraph text here</p>
<ul>
    <li>Item 1</li>
    <li>Item 2</li>
    <li>Item 3</li>
    <li>Item 4</li>
</ul>
<h3>Some Heading Text Here 2</h3>
<p>Some paragraph text here</p>
<p>Some paragraph text here</p>
<h4>Q1</h4>
<p>A1</p>
<h4>Q2</h4>
<p>A2</p>
<h4>Q3</h4>
<p>A3</p>"""

soup = BeautifulSoup(data)

tags = soup.find_all('h3')

text = ""
for i in tags:
    # print(i)
    text = text+str(i)
    for x in i.next_siblings:
        
        if x.name == 'h4':
            break
        else:
            text = text+str(x)
print(text)

Output：

<h3>Some Heading Text Here 1</h3>
<p>Some paragraph text here</p>
<p>Some paragraph text here</p>
<p>Some paragraph text here</p>
<ul>
<li>Item 1</li>
<li>Item 2</li>
<li>Item 3</li>
<li>Item 4</li>
</ul>
<h3>Some Heading Text Here 2</h3>
<p>Some paragraph text here</p>
<p>Some paragraph text here</p>
<h3>Some Heading Text Here 2</h3>
<p>Some paragraph text here</p>
<p>Some paragraph text here</p>

BeautifulSoup + Python（從頁面源代碼中提取特定的 HTML 標簽）

問題描述

1 個解決方案

解決方案1
0 2021-04-13 04:12:20

BeautifulSoup + Python（從頁面源代碼中提取特定的 HTML 標簽）

問題描述

1 個解決方案

解決方案1 0 2021-04-13 04:12:20

解決方案1
0 2021-04-13 04:12:20