简体   繁体   中英

BeautifulSoup - How do I extract a substring of a string between tags?

I would like to search the HTML for "Website:" and then return " http://www.aa.com "

<br>Website:  <a href="http://www.aa.com">http://www.aa.com</a><br>

I'm not sure what to do here since there is a clause in between the two strings.

You can search for the text; the result in a NavigableString object, which retains information about where in the tree it lives, which means you can ask it for the next sibling from that element:

>>> from bs4 import BeautifulSoup
>>> import re
>>> sample = '''\
... <br>Website:  <a href="http://www.aa.com">http://www.aa.com</a><br>
... '''
>>> soup = BeautifulSoup(sample)
>>> soup.find(text=re.compile('Website:'))
u'Website:  '
>>> soup.find(text=re.compile('Website:')).next_sibling
<a href="http://www.aa.com">http://www.aa.com</a>

Once you have the <a> element getting either the href attribute or the contained text is trivial:

>>> soup.find(text=re.compile('Website:')).next_sibling['href']
'http://www.aa.com'
>>> soup.find(text=re.compile('Website:')).next_sibling.string
u'http://www.aa.com'

Think of your content as a tree rather than a string.
Beautifulsoup gives you access to the parse tree, issue a findall('a') , then navigate the parsetree whith parent() and contents() , You can navigate to siblings too.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM