简体   繁体   中英

Extracting text between <br> with beautifulsoup, but without next tag

I'm using python + beautifulsoup to try to get the text between the br's. The closest I got to this was by using next_sibling in the following manner:

<html>
<body>
</a><span class="strong">Title1</span>
<p>Text1</p>
<br>The Text I want to get<br>
<p>Text I dont want</p>
</body>
</html>

for span in soup.findAll("span", {"class" : "strong"}):
    print(span.next_sibling.next_sibling.text)

But this prints:

The Text I want to getText I dont want

So what i want is after the first p, but before the second, but I can't figure out how to extract when there are no real tags, and only just the br's as references.

I need it to print:

The Text I want to get

Since the HTML you've provided is broken , the behavior would differ from parser to parser that BeautifulSoup uses.

In case of lxml parser , BeautifulSoup would convert the br tag into a self-closing one:

>>> soup = BeautifulSoup(data, 'lxml')
>>> print soup
<html>
<body>
<span class="strong">Title1</span>
<p>Text1</p>
<br/>The Text I want to get<br/>
<p>Text I dont want</p>
</body>
</html>

Note that you would need lxml to be installed. If it is okay for you - find the br and get the next sibling:

from bs4 import BeautifulSoup

data = """your HTML"""
soup = BeautifulSoup(data, 'lxml')

print(soup.br.next_sibling)  # prints "The Text I want to get"

Also see:

Using Python Scrapy

In [4]: hxs.select('//body/text()').extract()
Out[4]: [u'\n', u'\n', u'\n', u'The Text I want to get', u'\n', u'\n']

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM