提取文本之间 <br> 与beautifulsoup，但没有下一个标签

Question

I'm using python + beautifulsoup to try to get the text between the br's. 我正在使用python + beautifulsoup来尝试在br之间获取文本。 The closest I got to this was by using next_sibling in the following manner: 我最接近的是通过以下方式使用next_sibling：

<html>
<body>
</a><span class="strong">Title1</span>
<p>Text1</p>
<br>The Text I want to get<br>
<p>Text I dont want</p>
</body>
</html>

for span in soup.findAll("span", {"class" : "strong"}):
    print(span.next_sibling.next_sibling.text)

But this prints: 但这打印：

The Text I want to getText I dont want

So what i want is after the first p, but before the second, but I can't figure out how to extract when there are no real tags, and only just the br's as references. 所以我想要的是在第一个p之后，但在第二个之前，但我无法弄清楚当没有真正的标签时如何提取，只有br作为参考。

I need it to print: 我需要打印：

The Text I want to get

Answer 1

Since the HTML you've provided is broken , the behavior would differ from parser to parser that BeautifulSoup uses. 由于您提供的HTML已损坏 ，因此BeautifulSoup使用的解析器与解析器的行为会有所不同。

In case of lxml parser , BeautifulSoup would convert the br tag into a self-closing one: 对于lxml解析器， BeautifulSoup会将br标记转换为自关闭标记：

>>> soup = BeautifulSoup(data, 'lxml')
>>> print soup
<html>
<body>
<span class="strong">Title1</span>
<p>Text1</p>
<br/>The Text I want to get<br/>
<p>Text I dont want</p>
</body>
</html>

Note that you would need lxml to be installed. 请注意，您需要安装lxml 。 If it is okay for you - find the br and get the next sibling: 如果它对你没问题 - 找到br并获得下一个兄弟：

from bs4 import BeautifulSoup

data = """your HTML"""
soup = BeautifulSoup(data, 'lxml')

print(soup.br.next_sibling)  # prints "The Text I want to get"

Also see: 另见：

Answer 2

Using Python Scrapy 使用Python Scrapy

In [4]: hxs.select('//body/text()').extract()
Out[4]: [u'\n', u'\n', u'\n', u'The Text I want to get', u'\n', u'\n']

提取文本之间 <br> 与beautifulsoup，但没有下一个标签

问题描述

2 个解决方案

解决方案1
3 已采纳 2015-01-07 07:49:06

解决方案2
0 2015-01-07 07:46:39

提取文本之间 <br> 与beautifulsoup，但没有下一个标签

问题描述

2 个解决方案

解决方案1 3 已采纳 2015-01-07 07:49:06

解决方案2 0 2015-01-07 07:46:39

解决方案1
3 已采纳 2015-01-07 07:49:06

解决方案2
0 2015-01-07 07:46:39