使用beautifulsoup在換行符之間提取文本（例如 <br /> 標簽）

Question

我有一個更大的文檔中的以下HTML

<br />
Important Text 1
<br />
<br />
Not Important Text
<br />
Important Text 2
<br />
Important Text 3
<br />
<br />
Non Important Text
<br />
Important Text 4
<br />

我目前正在使用BeautifulSoup來獲取HTML中的其他元素，但我還沒有找到一種方法來獲取<br />標記之間的重要文本行。 我可以隔離並導航到每個<br />元素，但無法找到一種方法來獲取它們之間的文本。 任何幫助將不勝感激。 謝謝。

Answer 1

如果您只想要兩個<br />標簽之間的任何文本，您可以執行以下操作：

from BeautifulSoup import BeautifulSoup, NavigableString, Tag

input = '''<br />
Important Text 1
<br />
<br />
Not Important Text
<br />
Important Text 2
<br />
Important Text 3
<br />
<br />
Non Important Text
<br />
Important Text 4
<br />'''

soup = BeautifulSoup(input)

for br in soup.findAll('br'):
    next_s = br.nextSibling
    if not (next_s and isinstance(next_s,NavigableString)):
        continue
    next2_s = next_s.nextSibling
    if next2_s and isinstance(next2_s,Tag) and next2_s.name == 'br':
        text = str(next_s).strip()
        if text:
            print "Found:", next_s

但也許我誤解了你的問題？ 您對問題的描述似乎與示例數據中的“重要”/“非重要”不匹配，所以我已經刪除了描述;）

Answer 2

因此，出於測試目的，我們假設這個HTML塊位於span標記內：

x = """<span><br />
Important Text 1
<br />
<br />
Not Important Text
<br />
Important Text 2
<br />
Important Text 3
<br />
<br />
Non Important Text
<br />
Important Text 4
<br /></span>"""

現在我要解析它並找到我的span標簽：

from BeautifulSoup import BeautifulSoup
y = soup.find('span')

如果你在y.childGenerator()迭代生成器，你將獲得br和文本：

In [4]: for a in y.childGenerator(): print type(a), str(a)
   ....: 
<type 'instance'> <br />
<class 'BeautifulSoup.NavigableString'> 
Important Text 1

<type 'instance'> <br />
<class 'BeautifulSoup.NavigableString'> 

<type 'instance'> <br />
<class 'BeautifulSoup.NavigableString'> 
Not Important Text

<type 'instance'> <br />
<class 'BeautifulSoup.NavigableString'> 
Important Text 2

<type 'instance'> <br />
<class 'BeautifulSoup.NavigableString'> 
Important Text 3

<type 'instance'> <br />
<class 'BeautifulSoup.NavigableString'> 

<type 'instance'> <br />
<class 'BeautifulSoup.NavigableString'> 
Non Important Text

<type 'instance'> <br />
<class 'BeautifulSoup.NavigableString'> 
Important Text 4

<type 'instance'> <br />

Answer 3

以下對我有用：

for br in soup.findAll('br'):
    if str(type(br.contents[0])) == '<class \'BeautifulSoup.NavigableString\'>':
       print br.contents[0]

使用beautifulsoup在換行符之間提取文本（例如 <br /> 標簽）

問題描述

3 個解決方案

解決方案1
22 已采納 2011-03-11 17:00:28

解決方案2
6 2011-03-11 17:01:44

解決方案3
0 2016-02-02 16:59:20

使用beautifulsoup在換行符之間提取文本（例如 <br /> 標簽）

問題描述

3 個解決方案

解決方案1 22 已采納 2011-03-11 17:00:28

解決方案2 6 2011-03-11 17:01:44

解決方案3 0 2016-02-02 16:59:20

解決方案1
22 已采納 2011-03-11 17:00:28

解決方案2
6 2011-03-11 17:01:44

解決方案3
0 2016-02-02 16:59:20