![](/img/trans.png)
[英]Extract text in between <br/> tags using BeautifulSoup to separate panda columns
[英]Using beautifulsoup to extract text between line breaks (e.g. <br /> tags)
我有一個更大的文檔中的以下HTML
<br />
Important Text 1
<br />
<br />
Not Important Text
<br />
Important Text 2
<br />
Important Text 3
<br />
<br />
Non Important Text
<br />
Important Text 4
<br />
我目前正在使用BeautifulSoup來獲取HTML中的其他元素,但我還沒有找到一種方法來獲取<br />
標記之間的重要文本行。 我可以隔離並導航到每個<br />
元素,但無法找到一種方法來獲取它們之間的文本。 任何幫助將不勝感激。 謝謝。
如果您只想要兩個<br />
標簽之間的任何文本,您可以執行以下操作:
from BeautifulSoup import BeautifulSoup, NavigableString, Tag
input = '''<br />
Important Text 1
<br />
<br />
Not Important Text
<br />
Important Text 2
<br />
Important Text 3
<br />
<br />
Non Important Text
<br />
Important Text 4
<br />'''
soup = BeautifulSoup(input)
for br in soup.findAll('br'):
next_s = br.nextSibling
if not (next_s and isinstance(next_s,NavigableString)):
continue
next2_s = next_s.nextSibling
if next2_s and isinstance(next2_s,Tag) and next2_s.name == 'br':
text = str(next_s).strip()
if text:
print "Found:", next_s
但也許我誤解了你的問題? 您對問題的描述似乎與示例數據中的“重要”/“非重要”不匹配,所以我已經刪除了描述;)
因此,出於測試目的,我們假設這個HTML塊位於span
標記內:
x = """<span><br />
Important Text 1
<br />
<br />
Not Important Text
<br />
Important Text 2
<br />
Important Text 3
<br />
<br />
Non Important Text
<br />
Important Text 4
<br /></span>"""
現在我要解析它並找到我的span標簽:
from BeautifulSoup import BeautifulSoup
y = soup.find('span')
如果你在y.childGenerator()
迭代生成器,你將獲得br和文本:
In [4]: for a in y.childGenerator(): print type(a), str(a)
....:
<type 'instance'> <br />
<class 'BeautifulSoup.NavigableString'>
Important Text 1
<type 'instance'> <br />
<class 'BeautifulSoup.NavigableString'>
<type 'instance'> <br />
<class 'BeautifulSoup.NavigableString'>
Not Important Text
<type 'instance'> <br />
<class 'BeautifulSoup.NavigableString'>
Important Text 2
<type 'instance'> <br />
<class 'BeautifulSoup.NavigableString'>
Important Text 3
<type 'instance'> <br />
<class 'BeautifulSoup.NavigableString'>
<type 'instance'> <br />
<class 'BeautifulSoup.NavigableString'>
Non Important Text
<type 'instance'> <br />
<class 'BeautifulSoup.NavigableString'>
Important Text 4
<type 'instance'> <br />
以下對我有用:
for br in soup.findAll('br'):
if str(type(br.contents[0])) == '<class \'BeautifulSoup.NavigableString\'>':
print br.contents[0]
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.