简体   繁体   中英

Find all tags containing a string in BeautifulSoup

In BeautifulSoup, I can use find_all(string='example') to find all NavigableStrings that match against a string or regex.

Is there a way to do this using get_text() instead of string , so that the search matches a string even if it spans across multiple nodes? ie I'd want to do something like: find_all(get_text()='Python BeautifulSoup') , which would match against the entire inner string content.

For example, take this snippet:

<body>
  <div>
    Python
    <br>
    BeautifulSoup
  </div>
</body>

If I wanted to find 'Python Beautiful Soup' and have it return both the body and div tags, how could I accomplish this?

You could use css selectors in combination with pseudo class :-soup-contains-own()

soup.select_one(':-soup-contains-own("BeautifulSoup")')

or get only text of element:

soup.select_one(':-soup-contains-own("BeautifulSoup")').get_text(' ', strip=True)

Example

from bs4 import BeautifulSoup

html = '''
<body>
  <div>
    Python
    <br>
    BeautifulSoup
  </div>
</body>
'''
soup = BeautifulSoup(html)

soup.select(':-soup-contains-own("BeautifulSoup")')

Output

[<div>
 Python
 <br/>
 BeautifulSoup
</div>]

You can use lambda function in .find_all :

from bs4 import BeautifulSoup

html_doc = '''\
<body>
  <div>
    Python
    <br>
    BeautifulSoup
  </div>
</body>'''

soup = BeautifulSoup(html_doc, 'html.parser')

for tag in soup.find_all(lambda tag: 'Python BeautifulSoup' in tag.get_text(strip=True, separator=' ')):
    print(tag.name)

Prints:

body
div

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2025 STACKOOM.COM