I want to store the dates from the following chunk of text:
newsoup = '''<html><body><a href="/president/washington/speeches/speech-3460">Proclamation
of Pardons in Western Pennsylvania (July 10, 1795)</a>, <a class="transcript" href="/president/washington/speeches/speech-3460">Transcript</a>,
<a href="/president/washington/speeches/speech-3939">Seventh Annual Message to Congress (December 8, 1795)</a></body></html>'''
But, I'm having trouble getting at the text between >
and </a>
. Once I get Proclamation of Pardons in Western Pennsylvania (July 10, 1795)
, I'll be set. I've tried adapting another approach to my specific data, but I end up with an empty object.
I'm trying something like the following, but having little luck:
a = newsoup.findAll('a',attrs={'href'})
print a
I should have noted that newsoup
was already a soup object.
Assuming newsoup is a soup object, I think this should work:
(If it is not, you can run newsoup = BeautifulSoup(newsoup)
)
a = newsoup.findAll('a')
for x in a:
print x.text
This will work for you:
a = newsoup.findAll('a')[0].contents[0]
where newsoup
is a BeautifulSoup object.
Or else first do:
newsoup = BeautifulSoup(newsoup)
You can put that in a loop:
a = soup.findAll('a')
for x in a:
print x.contents[0]
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.