简体   繁体   中英

What's wrong with my soup?

I am using python with BeautifulSoup 4 to find links in a html page that match a particular regular expression. I am able to find links and text matching with the regex but the both things combined together won't work. Here's my code:

import re
import bs4

s = '<a href="javascript://">Sign in&nbsp;<br /></a>'

soup = bs4.BeautifulSoup(s)

match = re.compile(r'sign\s?in', re.IGNORECASE)

print soup.find_all(text=match)  # [u'Sign in\xa0']
print soup.find_all(name='a')[0].text  # Sign in 

print soup.find_all('a', text=match) # []

Comments are the outputs . As you can see the combined search returns no result. This is strange.

Seems that there's something to do with the "br" tag (or a generic tag) contained inside the link text. If you delete it everything works as expected.

you can either look for the tag or look for its text content but not together:

given that:

self.name = u'a'
self.text = SRE_Pattern: <_sre.SRE_Pattern object at 0xd52a58>

from the source :

# If it's text, make sure the text matches.
elif isinstance(markup, NavigableString) or \
         isinstance(markup, basestring):
    if not self.name and not self.attrs and self._matches(markup, self.text):
        found = markup

that makes @Totem remark the way to go by design

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM