Python BeautifulSoup find_all with regex doesn't match text

Question

I have the following HTML code:

<a class="nav-link" href="https://cbd420.ch/fr/tous-les-produits/">
<span class="cbp-tab-title">
                                Shop <i class="fa fa-angle-down cbp-submenu-aindicator"></i></span>
</a>

I would like to get the anchor tag that has Shop as text disregarding the spacing before and after. I have tried the following code, but I keep getting an empty array:

import re
html  = """<a class="nav-link" href="https://cbd420.ch/fr/tous-les-produits/">
<span class="cbp-tab-title">
                                Shop <i class="fa fa-angle-down cbp-submenu-aindicator"></i></span>
</a>"""
soup = BeautifulSoup(html, 'html.parser')
prog = re.compile('\s*Shop\s*')
print(soup.find_all("a", string=prog))
# Output: []

I also tried retrieving the text using get_text() :

text = soup.find_all("a")[0].get_text()
print(repr(text))
# Output: '\n\n\t\t\t\t\t\t\t\tShop \n'

and ran the following code to make sure my Regex was right, which seems to be to the case.

result = prog.match(text)
print(repr(result.group()))
# Output: '\n\n\t\t\t\t\t\t\t\tShop \n'

I also tried selecting span instead of a but I get the same issue. I'm guessing it's something with find_all , I have read the BeautifulSoup documentation but I still can't find the issue. Any help would be appreciated. Thanks!

Answer 1

The problem you have here is that the text you are looking for is in a tag that contains children tags, and when a tag has children tags, the string property is empty.

You can use a lambda expression in the .find call and since you are looking for a fixed string, you may use a mere 'Shop' in t.text condition rather than a regex check:

soup.find(lambda t: t.name == "a" and 'Shop' in t.text)

Answer 2

The text Shop you are searching it is inside span tag so when you are trying with regular expression its unable to fetch the value using regex.

You can try regex to find text and then parent of that.

import re
html  = """<a class="nav-link" href="https://cbd420.ch/fr/tous-les-produits/">
<span class="cbp-tab-title">
                                Shop <i class="fa fa-angle-down cbp-submenu-aindicator"></i></span>
</a>"""
soup = BeautifulSoup(html, 'html.parser')
print(soup.find(text=re.compile('Shop')).parent.parent)

If you have BS 4.7.1 or above you can use following css selector.

html  = """<a class="nav-link" href="https://cbd420.ch/fr/tous-les-produits/">
<span class="cbp-tab-title">
                                Shop <i class="fa fa-angle-down cbp-submenu-aindicator"></i></span>
</a>"""
soup = BeautifulSoup(html, 'html.parser')
print(soup.select_one('a:contains("Shop")'))

Python BeautifulSoup find_all with regex doesn't match text

Question

2 answers

solution1
1 ACCPTED 2020-04-30 14:17:46

solution2
0 2020-04-30 14:47:34

Python BeautifulSoup find_all with regex doesn't match text

Question

2 answers

solution1 1 ACCPTED 2020-04-30 14:17:46

solution2 0 2020-04-30 14:47:34

solution1
1 ACCPTED 2020-04-30 14:17:46

solution2
0 2020-04-30 14:47:34