BeautifulSoup - 如何在不打开标签的情况下提取文本 <br> 标签？

Question

I'm new to python and beautifulsoup and spent quite a few hours trying to figure this one out. 我是python和beautifulsoup的新手，花了不少时间试图弄清楚这个。
I want to extract three particular text extracts within a <div> that has no class. 我想在没有类的<div>中提取三个特定的文本提取。
The first text extract I want is within an <a> tag which is within an <h4> tag. 第一个文本提取我想是一个内<a>标签，其是内<h4>标记。 This I managed to extract it. 我设法提取它。
The second text extract immediately follows the closing h4 tag </h4> and is followed by a <br> tag. 第二个文本提取紧跟在结束h4标记</h4>之后，后跟一个<br>标记。
The third text extract immediately follows the <br> tag after the second text extract and is also followed by a <br> tag. 第二个文本提取紧跟在第二个文本提取之后的<br>标记之后，后面跟着一个<br>标记。

Here the html extract I work with: 这里是我使用的html提取：

<div>
    <h4 class="actorboxLink">
    <a href="/a-decheterie-de-bagnols-2689">Decheterie de Bagnols</a>
    </h4>
    Route des 4 Vents<br>
    63810 Bagnols<br>
</div>

I want to extract: 我想提取：

Decheterie de Bagnols < That works Decheterie de Bagnols <有效

Route des 4 Vents < Doesn't work Route des 4 Vents <不起作用

63810 Bagnols < Doesn't work 63810 Bagnols <不起作用

Here is the code I have so far: 这是我到目前为止的代码：

import urllib
from bs4 import BeautifulSoup    
data = urllib.urlopen(url).read()
soup = BeautifulSoup(data, "html.parser")
name = soup.findAll("h4", class_="actorboxLink")

for a_tag in name:
    print a_tag.text.strip()

I need something like "soup.findAll( all text after </h4> )" 我需要像“soup.findAll（ </h4>之后的所有文字 ）”之类的东西

I played with using .next_sibling but I can't get it to work. 我使用.next_sibling玩，但我不能让它工作。

Any ideas? 有任何想法吗？ Thanks 谢谢

UPDATE: 更新：
I tried this: 我试过这个：

for a_tag in classActorboxLink:
    print a_tag.find_all_next(string=True, limit=5)

which gives me: 这给了我：
[u'\\n', u'\\r\\n\\t\\t\\t\\t\\t\\tDecheterie\\xa0de\\xa0Bagnols\\t\\t\\t\\t\\t', u'\\n', u'\\r\\n\\t\\t\\t\\tRoute\\xa0des\\xa04\\xa0Vents', u'\\r\\n\\t\\t\\t\\t63810 Bagnols'] [u'\\ n'，你'\\ r \\ n \\ t \\ t \\ t \\ t \\ t \\ tDecheterie \\ xa0de \\ xa0Bagnols \\ t \\ t \\ t \\ t \\ t \\''，你'\\ n'，你' r \\ n \\ t \\ t \\ t \\ tRoute \\ xa0des \\ xa04 \\ xa0Vents'，u'\\ r \\ n \\ t \\ t \\ t \\ t \\ tt33810 Bagnols']

It's a start but I need to relove all the whitespaces and unecessary characters. 这是一个开始，但我需要重新获得所有的空白和不必要的角色。 I tried using .strip() , .strings and .stripped_strings but it doesn't work. 我试着用.strip() .strings和.stripped_strings ，但它不工作。 Examples: 例子：

for a_tag in classActorboxLink.strings

for a_tag in classActorboxLink.stripped_strings

print a_tag.find_all_next(string=True, limit=5).strip()

For all three I get: 对于这三个我得到：

AttributeError: 'ResultSet' object has no attribute 'strings/stripped_strings/strip'

Answer 1

Locate the h4 element and use find_next_siblings() : 找到h4元素并使用find_next_siblings() ：

h4s = soup.find_all("h4", class_="actorboxLink")
for h4 in h4s:
    for text in h4.find_next_siblings(text=True):
        print(text.strip())

Answer 2

If you don't need each of the 3 elements you are looking for in different variables you could just use the get_text() function on the <div> to get them all in one string. 如果你不需要在不同变量中寻找的3个元素中的每一个，你可以使用<div>上的get_text()函数将它们全部放在一个字符串中。 If there are other div tags but they all have classes you can find all the <div> with class=false . 如果有其他div标签，但它们都有类，你可以找到所有<div> with class=false 。 If you can't isolate the <div> that you are interested in then this solution won't work for you. 如果您无法隔离您感兴趣的<div> ，那么此解决方案将不适合您。

import urllib
from bs4 import BeautifulSoup    
data = urllib.urlopen(url).read()
soup = BeautifulSoup(data, "html.parser")

for name in soup.find_all("div", class=false)
     print name.get_text().strip()

BTW this is python 3 & bs4 顺便说一句，这是python 3＆bs4

BeautifulSoup - 如何在不打开标签的情况下提取文本 <br> 标签？

问题描述

2 个解决方案

解决方案1
2 已采纳 2015-09-22 01:47:06

解决方案2
0 2015-09-22 03:07:17

BeautifulSoup - 如何在不打开标签的情况下提取文本 <br> 标签？

问题描述

2 个解决方案

解决方案1 2 已采纳 2015-09-22 01:47:06

解决方案2 0 2015-09-22 03:07:17

解决方案1
2 已采纳 2015-09-22 01:47:06

解决方案2
0 2015-09-22 03:07:17