简体   繁体   中英

BeautifulSoup: extract between href and class?

I want to store the dates from the following chunk of text:

newsoup = '''<html><body><a href="/president/washington/speeches/speech-3460">Proclamation 
of Pardons in Western Pennsylvania (July 10, 1795)</a>, <a class="transcript" href="/president/washington/speeches/speech-3460">Transcript</a>, 
<a href="/president/washington/speeches/speech-3939">Seventh Annual Message to Congress (December 8, 1795)</a></body></html>'''

But, I'm having trouble getting at the text between > and </a> . Once I get Proclamation of Pardons in Western Pennsylvania (July 10, 1795) , I'll be set. I've tried adapting another approach to my specific data, but I end up with an empty object.

I'm trying something like the following, but having little luck:

a = newsoup.findAll('a',attrs={'href'})
print a

I should have noted that newsoup was already a soup object.

Assuming newsoup is a soup object, I think this should work:

(If it is not, you can run newsoup = BeautifulSoup(newsoup) )

a = newsoup.findAll('a')
for x in a:
    print x.text

This will work for you:

a = newsoup.findAll('a')[0].contents[0]

where newsoup is a BeautifulSoup object.

Or else first do:

newsoup = BeautifulSoup(newsoup)

You can put that in a loop:

a = soup.findAll('a')
for x in a:
    print x.contents[0]

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM