BeautifulSoup - 如何在不打開標簽的情況下提取文本 <br> 標簽？

Question

我是python和beautifulsoup的新手，花了不少時間試圖弄清楚這個。
我想在沒有類的<div>中提取三個特定的文本提取。
第一個文本提取我想是一個內<a>標簽，其是內<h4>標記。 我設法提取它。
第二個文本提取緊跟在結束h4標記</h4>之后，后跟一個<br>標記。
第二個文本提取緊跟在第二個文本提取之后的<br>標記之后，后面跟着一個<br>標記。

這里是我使用的html提取：

<div>
    <h4 class="actorboxLink">
    <a href="/a-decheterie-de-bagnols-2689">Decheterie de Bagnols</a>
    </h4>
    Route des 4 Vents<br>
    63810 Bagnols<br>
</div>

我想提取：

Decheterie de Bagnols <有效

Route des 4 Vents <不起作用

63810 Bagnols <不起作用

這是我到目前為止的代碼：

import urllib
from bs4 import BeautifulSoup    
data = urllib.urlopen(url).read()
soup = BeautifulSoup(data, "html.parser")
name = soup.findAll("h4", class_="actorboxLink")

for a_tag in name:
    print a_tag.text.strip()

我需要像“soup.findAll（ </h4>之后的所有文字 ）”之類的東西

我使用.next_sibling玩，但我不能讓它工作。

有任何想法嗎？ 謝謝

更新：
我試過這個：

for a_tag in classActorboxLink:
    print a_tag.find_all_next(string=True, limit=5)

這給了我：
[u'\\ n'，你'\\ r \\ n \\ t \\ t \\ t \\ t \\ t \\ tDecheterie \\ xa0de \\ xa0Bagnols \\ t \\ t \\ t \\ t \\ t \\''，你'\\ n'，你' r \\ n \\ t \\ t \\ t \\ tRoute \\ xa0des \\ xa04 \\ xa0Vents'，u'\\ r \\ n \\ t \\ t \\ t \\ t \\ tt33810 Bagnols']

這是一個開始，但我需要重新獲得所有的空白和不必要的角色。 我試着用.strip() .strings和.stripped_strings ，但它不工作。 例子：

for a_tag in classActorboxLink.strings

for a_tag in classActorboxLink.stripped_strings

print a_tag.find_all_next(string=True, limit=5).strip()

對於這三個我得到：

AttributeError: 'ResultSet' object has no attribute 'strings/stripped_strings/strip'

Answer 1

找到h4元素並使用find_next_siblings() ：

h4s = soup.find_all("h4", class_="actorboxLink")
for h4 in h4s:
    for text in h4.find_next_siblings(text=True):
        print(text.strip())

Answer 2

如果你不需要在不同變量中尋找的3個元素中的每一個，你可以使用<div>上的get_text()函數將它們全部放在一個字符串中。 如果有其他div標簽，但它們都有類，你可以找到所有<div> with class=false 。 如果您無法隔離您感興趣的<div> ，那么此解決方案將不適合您。

import urllib
from bs4 import BeautifulSoup    
data = urllib.urlopen(url).read()
soup = BeautifulSoup(data, "html.parser")

for name in soup.find_all("div", class=false)
     print name.get_text().strip()

順便說一句，這是python 3＆bs4

BeautifulSoup - 如何在不打開標簽的情況下提取文本 <br> 標簽？

問題描述

2 個解決方案

解決方案1
2 已采納 2015-09-22 01:47:06

解決方案2
0 2015-09-22 03:07:17

BeautifulSoup - 如何在不打開標簽的情況下提取文本 <br> 標簽？

問題描述

2 個解決方案

解決方案1 2 已采納 2015-09-22 01:47:06

解決方案2 0 2015-09-22 03:07:17

解決方案1
2 已采納 2015-09-22 01:47:06

解決方案2
0 2015-09-22 03:07:17