How can I extract all text between tags?

Question

I would like to extract a random poem from this book .

Using BeautifulSoup, I have been able to find the title and prose.

print soup.find('div', class_="pre_poem").text
print soup.find('table', class_="poem").text

But I would like to find all the poems and pick one. Should I use a regex and match all between <h3> and </span></p> ?

Answer 1

Assuming you already have a suitable soup object to work with, the following might help you get started:

poem_ids = []

for section in soup.find_all('ol', class_="TOC"):
    poem_ids.extend(li.find('a').get('href') for li in section.find_all('li'))

poem_ids = [id[1:] for id in poem_ids[:-1] if id]
poem_id = random.choice(poem_ids)

poem_start = soup.find('a', id=poem_id)
poem = poem_start.find_next()
poem_text = []

while True:
    poem = poem.next_element

    if poem.name == 'h3':
        break

    if poem.name == None:
        poem_text.append(poem.string)

print '\n'.join(poem_text).replace('\n\n\n', '\n')

This first extracts a list of the poems from the table of contents at the top of the page. These contain unique IDs to each of the poems. Next a random ID is chosen and the matching poem is then extracted based on that ID.

For example, if the first poem was selected, you would see the following output:

"The Arrow and the Song," by Longfellow (1807-82), is placed first in
this volume out of respect to a little girl of six years who used to
love to recite it to me. She knew many poems, but this was her
favourite.


I shot an arrow into the air,
It fell to earth, I knew not where;
For, so swiftly it flew, the sight
Could not follow it in its flight.


I breathed a song into the air,
It fell to earth, I knew not where;
For who has sight so keen and strong
That it can follow the flight of song?


Long, long afterward, in an oak
I found the arrow, still unbroke;
And the song, from beginning to end,
I found again in the heart of a friend.


Henry W. Longfellow.

This is done by using BeautifulSoup to extract all of the text from each element until the next <h3> tag is found, and then removing any extra line breaks.

Answer 2

Use an html document parser instead. It's safer in terms of the unintended consquences.

The reason why all programmers discourage parsing HTML with regex is that HTML mark-up of the page is not static especially if your souce HTML is a webpage. Regex is better suited for strings.

Use regex at your own risk.

How can I extract all text between tags?

Question

2 answers

solution1
0 2015-12-16 13:09:29

solution2
0 2015-12-16 13:10:58

How can I extract all text between tags?

Question

2 answers

solution1 0 2015-12-16 13:09:29

solution2 0 2015-12-16 13:10:58

solution1
0 2015-12-16 13:09:29

solution2
0 2015-12-16 13:10:58