简体   繁体   中英

How do you extract a substring of a string when its siblings have a parent with the same name (BeautifulSoup)

For example:

<ul class="key-dates">
            
                <li>
                    Birthday: Monday 26 April 2021
                </li>
                <li>
                    Christmas: Saturday 25 December 2021
                </li>
                <li>
                    New Years: Saturday 1 January 2021
                </li>
            
        </ul>

Say if I just wanted to pull out the birthday date how would I do so?

import requests
import bs4

info = requests.get('url')

You can use CSS selector ( :contains or :-soup-contains ):

from bs4 import BeautifulSoup

html_doc = """
<ul class="key-dates">
            
                <li>
                    Birthday: Monday 26 April 2021
                </li>
                <li>
                    Christmas: Saturday 25 December 2021
                </li>
                <li>
                    New Years: Saturday 1 January 2021
                </li>
            
        </ul>
"""

soup = BeautifulSoup(html_doc, "html.parser")

birthday = soup.select_one('.key-dates li:-soup-contains("Birthday")')
print(birthday.text.strip())

Prints:

Birthday: Monday 26 April 2021

Or without CSS:

birthday = soup.find("li", text=lambda t: "Birthday" in t)
print(birthday.text.strip())

Unfortunately, there isn't a really 100% guarantee way of doing this because the li tags don't have any unique identifiers. The best way to do this is to find the closest parent tag that has a unique identifier and parse your birthday from there.

In this case it would look something like:

from bs4 import BeautifulSoup
import requests

def get_source(url):
    return BeautifulSoup(requests.get(url, headers={"User-Agent": "Mozilla/5.0"}).text, 'html.parser')

soup = get_source('url')

ul_list = soup.find('ul', class_='key-dates') # Gets the parent ul tag with the class='key-dates' and children
list_item = ul_list.find('li', text='Birthday: Monday 26 April 2021') # gets the li item you need containing whatever you pass in the text parameter.

print(list_item)                             # <li>Birthday: Monday 26 April 2021</li>
print(list_item.text)                        # Birthday: Monday 26 April 2021
print(list_item.text.split('Birthday: ')[1]) # Monday 26 April 2021

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM