繁体   English   中英

BeautifulSoup + Python(从页面源代码中提取特定的 HTML 标签)

[英]BeautifulSoup + Python (Extract Specific HTML Tags from Page Source Code)

我有以下 HTML 代码:

<h3>Some Heading Text Here 1</h3>
<p>Some paragraph text here</p>
<p>Some paragraph text here</p>
<p>Some paragraph text here</p>
<ul>
    <li>Item 1</li>
    <li>Item 2</li>
    <li>Item 3</li>
    <li>Item 4</li>
</ul>
<h3>Some Heading Text Here 2</h3>
<p>Some paragraph text here</p>
<p>Some paragraph text here</p>
<h4>Q1</h4>
<p>A1</p>
<h4>Q2</h4>
<p>A2</p>
<h4>Q3</h4>
<p>A3</p>

我想从第一个<h3>及其子级中提取 HTML 直到第一次出现<h4>标记。

预期 Output:

<h3>Some Heading Text Here 1</h3>
<p>Some paragraph text here</p>
<p>Some paragraph text here</p>
<p>Some paragraph text here</p>
<ul>
    <li>Item 1</li>
    <li>Item 2</li>
    <li>Item 3</li>
    <li>Item 4</li>
</ul>
<h3>Some Heading Text Here 2</h3>
<p>Some paragraph text here</p>
<p>Some paragraph text here</p>

我尝试了以下方法,结果如下:

from bs4 import BeautifulSoup

data = """<h3>Some Heading Text Here 1</h3>
<p>Some paragraph text here</p>
<p>Some paragraph text here</p>
<p>Some paragraph text here</p>
<ul>
    <li>Item 1</li>
    <li>Item 2</li>
    <li>Item 3</li>
    <li>Item 4</li>
</ul>
<h3>Some Heading Text Here 2</h3>
<p>Some paragraph text here</p>
<p>Some paragraph text here</p>
<h4>Q1</h4>
<p>A1</p>
<h4>Q2</h4>
<p>A2</p>
<h4>Q3</h4>
<p>A3</p>"""

soup = BeautifulSoup(data)

tags = soup.find_all('h3')

text = ""
for i in tags:
    # print(i)
    text = text+str(i)
    for x in i.next_siblings:
        
        if x.name == 'h4':
            break
        else:
            text = text+str(x)
print(text)

Output:

<h3>Some Heading Text Here 1</h3>
<p>Some paragraph text here</p>
<p>Some paragraph text here</p>
<p>Some paragraph text here</p>
<ul>
<li>Item 1</li>
<li>Item 2</li>
<li>Item 3</li>
<li>Item 4</li>
</ul>
<h3>Some Heading Text Here 2</h3>
<p>Some paragraph text here</p>
<p>Some paragraph text here</p>
<h3>Some Heading Text Here 2</h3>
<p>Some paragraph text here</p>
<p>Some paragraph text here</p>

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM