简体   繁体   中英

How can I extract all text between tags?

I would like to extract a random poem from this book .

Using BeautifulSoup, I have been able to find the title and prose.

print soup.find('div', class_="pre_poem").text
print soup.find('table', class_="poem").text

But I would like to find all the poems and pick one. Should I use a regex and match all between <h3> and </span></p> ?

Assuming you already have a suitable soup object to work with, the following might help you get started:

poem_ids = []

for section in soup.find_all('ol', class_="TOC"):
    poem_ids.extend(li.find('a').get('href') for li in section.find_all('li'))

poem_ids = [id[1:] for id in poem_ids[:-1] if id]
poem_id = random.choice(poem_ids)

poem_start = soup.find('a', id=poem_id)
poem = poem_start.find_next()
poem_text = []

while True:
    poem = poem.next_element

    if poem.name == 'h3':
        break

    if poem.name == None:
        poem_text.append(poem.string)

print '\n'.join(poem_text).replace('\n\n\n', '\n')

This first extracts a list of the poems from the table of contents at the top of the page. These contain unique IDs to each of the poems. Next a random ID is chosen and the matching poem is then extracted based on that ID.

For example, if the first poem was selected, you would see the following output:

"The Arrow and the Song," by Longfellow (1807-82), is placed first in
this volume out of respect to a little girl of six years who used to
love to recite it to me. She knew many poems, but this was her
favourite.


I shot an arrow into the air,
It fell to earth, I knew not where;
For, so swiftly it flew, the sight
Could not follow it in its flight.


I breathed a song into the air,
It fell to earth, I knew not where;
For who has sight so keen and strong
That it can follow the flight of song?


Long, long afterward, in an oak
I found the arrow, still unbroke;
And the song, from beginning to end,
I found again in the heart of a friend.


Henry W. Longfellow.

This is done by using BeautifulSoup to extract all of the text from each element until the next <h3> tag is found, and then removing any extra line breaks.

Use an html document parser instead. It's safer in terms of the unintended consquences.

The reason why all programmers discourage parsing HTML with regex is that HTML mark-up of the page is not static especially if your souce HTML is a webpage. Regex is better suited for strings.

Use regex at your own risk.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM