簡體   English   中英

BeautifulSoup + Python(從頁面源代碼中提取特定的 HTML 標簽)

[英]BeautifulSoup + Python (Extract Specific HTML Tags from Page Source Code)

我有以下 HTML 代碼:

<h3>Some Heading Text Here 1</h3>
<p>Some paragraph text here</p>
<p>Some paragraph text here</p>
<p>Some paragraph text here</p>
<ul>
    <li>Item 1</li>
    <li>Item 2</li>
    <li>Item 3</li>
    <li>Item 4</li>
</ul>
<h3>Some Heading Text Here 2</h3>
<p>Some paragraph text here</p>
<p>Some paragraph text here</p>
<h4>Q1</h4>
<p>A1</p>
<h4>Q2</h4>
<p>A2</p>
<h4>Q3</h4>
<p>A3</p>

我想從第一個<h3>及其子級中提取 HTML 直到第一次出現<h4>標記。

預期 Output:

<h3>Some Heading Text Here 1</h3>
<p>Some paragraph text here</p>
<p>Some paragraph text here</p>
<p>Some paragraph text here</p>
<ul>
    <li>Item 1</li>
    <li>Item 2</li>
    <li>Item 3</li>
    <li>Item 4</li>
</ul>
<h3>Some Heading Text Here 2</h3>
<p>Some paragraph text here</p>
<p>Some paragraph text here</p>

我嘗試了以下方法,結果如下:

from bs4 import BeautifulSoup

data = """<h3>Some Heading Text Here 1</h3>
<p>Some paragraph text here</p>
<p>Some paragraph text here</p>
<p>Some paragraph text here</p>
<ul>
    <li>Item 1</li>
    <li>Item 2</li>
    <li>Item 3</li>
    <li>Item 4</li>
</ul>
<h3>Some Heading Text Here 2</h3>
<p>Some paragraph text here</p>
<p>Some paragraph text here</p>
<h4>Q1</h4>
<p>A1</p>
<h4>Q2</h4>
<p>A2</p>
<h4>Q3</h4>
<p>A3</p>"""

soup = BeautifulSoup(data)

tags = soup.find_all('h3')

text = ""
for i in tags:
    # print(i)
    text = text+str(i)
    for x in i.next_siblings:
        
        if x.name == 'h4':
            break
        else:
            text = text+str(x)
print(text)

Output:

<h3>Some Heading Text Here 1</h3>
<p>Some paragraph text here</p>
<p>Some paragraph text here</p>
<p>Some paragraph text here</p>
<ul>
<li>Item 1</li>
<li>Item 2</li>
<li>Item 3</li>
<li>Item 4</li>
</ul>
<h3>Some Heading Text Here 2</h3>
<p>Some paragraph text here</p>
<p>Some paragraph text here</p>
<h3>Some Heading Text Here 2</h3>
<p>Some paragraph text here</p>
<p>Some paragraph text here</p>

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM