简体   繁体   中英

BeautifulSoup String Search

I have been googling and looking at other question here on search for a string in a BeautifulSoup object.

Per my searching, the following should detect the string - but it doesn't:

strings = soup.find_all(string='Results of Operations and Financial Condition')

However, the following detects the string:

tags = soup.find_all('div',{'class':'info'})

for tag in tags:

    if re.search('Results of Operations and Financial Condition',tag.text):

    ''' Do Something'''

Why does one work and the other not?

You might want to use:

strings = soup.find_all(string=lambda x: 'Results of Operations and Financial Condition' in x)

This happens because the implementation of find_all looks for the string you search to match exactly. I suppose you might have some other text next to 'Results of Operations and Financial Condition' .

If you check the docs here you can see that you can give a function to that string param and it seems that the following lines are equivalent:

soup.find_all(string='Results of Operations and Financial Condition')
soup.find_all(string=lambda x: x == 'Results of Operations and Financial Condition')

For this code

page = urllib.request.urlopen('https://en.wikipedia.org/wiki/Alloxylon_pinnatum')
sp = bs4.BeautifulSoup(page)
print(sp.find_all(string=re.compile('The pinkish-red compound flowerheads'))) # You need to use like this to search within text nodes.
print(sp.find_all(string='The pinkish-red compound flowerheads, known as'))
print(sp.find_all(string='The pinkish-red compound flowerheads, known as ')) #notice space at the end of string

Results are -

['The pinkish-red compound flowerheads, known as ']
[]
['The pinkish-red compound flowerheads, known as ']

It looks like string argument searches for exact full string match, not whether some HTML text node contains that string, but exact value of the HTML text node . You can however use regular expressions to search whether a text node contains some string, as shown in above code.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM