如何在python中兩個不同標簽之間提取html？

Question

我有以下html：

<h2>blah</h2>
html content to extract 
(here can come tags, nested structures too, but no top-level h2)
<h2>other blah</h2>

是否可以在不使用python中的string.split("<h2>")情況下提取內容？
（例如，使用BeautifulSoup或其他一些庫？）

Answer 1

使用BeautifulSoup，使用.next_siblings可迭代以獲取標簽后面的文本：

>>> from bs4 import BeautifulSoup, NavigableString
>>> from itertools import takewhile
>>> sample = '<h2>blah</h2>\nhtml content to extract\n<h2>other blah<h2>'
>>> soup = BeautifulSoup(sample)
>>> print ''.join(takewhile(lambda e: isinstance(e, NavigableString), soup.h2.next_siblings))

html content to extract

這將找到soup.h2元素之后的所有文本元素， soup.h2它們連接到一個字符串中。

Answer 2

這是來自http://htql.net的使用HTQL的一些測試代碼：

sample="""<h2>blah</h2>
        html content to extract 
        <div>test</div>
        <h2>other blah<h2>
    """

import htql
htql.query(sample, "<h2 sep excl>2")
# [('\n        html content to extract \n        <div>test</div>\n        ',)]

htql.query(sample, "<h2 sep> {a=<h2>:tx; b=<h2 sep excl>2 | a='blah'} ")
# [('blah', '\n        html content to extract \n        <div>test</div>\n        ')]

Answer 3

讓我分享一個更強大的解決方案：

def get_chunk_after_tag(tag):
    """ tag is a tag element in a bs4 soup.
    """
    result = ''
    for elem in tag.next_siblings:
        if isinstance(elem, bs4.Tag) and elem.name == tag.name:
            break
        result += str(elem)
    return result

用於將文本從<hX>提取到<hX> 。 它很容易修改以將文本從標簽提取到另一個標簽。

如何在python中兩個不同標簽之間提取html？

問題描述

3 個解決方案

解決方案1
1 2013-11-12 16:30:43

解決方案2
1 已采納 2013-11-14 15:19:29

解決方案3
0 2013-11-12 18:23:24

如何在python中兩個不同標簽之間提取html？

問題描述

3 個解決方案

解決方案1 1 2013-11-12 16:30:43

解決方案2 1 已采納 2013-11-14 15:19:29

解決方案3 0 2013-11-12 18:23:24

解決方案1
1 2013-11-12 16:30:43

解決方案2
1 已采納 2013-11-14 15:19:29

解決方案3
0 2013-11-12 18:23:24