What's wrong with my soup?

Question

I am using python with BeautifulSoup 4 to find links in a html page that match a particular regular expression. I am able to find links and text matching with the regex but the both things combined together won't work. Here's my code:

import re
import bs4

s = '<a href="javascript://">Sign in&nbsp;<br /></a>'

soup = bs4.BeautifulSoup(s)

match = re.compile(r'sign\s?in', re.IGNORECASE)

print soup.find_all(text=match)  # [u'Sign in\xa0']
print soup.find_all(name='a')[0].text  # Sign in 

print soup.find_all('a', text=match) # []

Comments are the outputs . As you can see the combined search returns no result. This is strange.

Seems that there's something to do with the "br" tag (or a generic tag) contained inside the link text. If you delete it everything works as expected.

Answer 1

you can either look for the tag or look for its text content but not together:

given that:

self.name = u'a'
self.text = SRE_Pattern: <_sre.SRE_Pattern object at 0xd52a58>

from the source :

# If it's text, make sure the text matches.
elif isinstance(markup, NavigableString) or \
         isinstance(markup, basestring):
    if not self.name and not self.attrs and self._matches(markup, self.text):
        found = markup

that makes @Totem remark the way to go by design

What's wrong with my soup?

Question

1 answers

solution1
0 2014-02-20 02:11:05

What's wrong with my soup?

Question

1 answers

solution1 0 2014-02-20 02:11:05

solution1
0 2014-02-20 02:11:05