简体   繁体   English

当其兄弟姐妹具有相同名称的父级时,如何提取字符串的 substring (BeautifulSoup)

[英]How do you extract a substring of a string when its siblings have a parent with the same name (BeautifulSoup)

For example:例如:

<ul class="key-dates">
            
                <li>
                    Birthday: Monday 26 April 2021
                </li>
                <li>
                    Christmas: Saturday 25 December 2021
                </li>
                <li>
                    New Years: Saturday 1 January 2021
                </li>
            
        </ul>

Say if I just wanted to pull out the birthday date how would I do so?说如果我只是想取消生日日期,我会怎么做?

import requests
import bs4

info = requests.get('url')

You can use CSS selector ( :contains or :-soup-contains ):您可以使用 CSS 选择器( :contains:-soup-contains ):

from bs4 import BeautifulSoup

html_doc = """
<ul class="key-dates">
            
                <li>
                    Birthday: Monday 26 April 2021
                </li>
                <li>
                    Christmas: Saturday 25 December 2021
                </li>
                <li>
                    New Years: Saturday 1 January 2021
                </li>
            
        </ul>
"""

soup = BeautifulSoup(html_doc, "html.parser")

birthday = soup.select_one('.key-dates li:-soup-contains("Birthday")')
print(birthday.text.strip())

Prints:印刷:

Birthday: Monday 26 April 2021

Or without CSS:或者没有 CSS:

birthday = soup.find("li", text=lambda t: "Birthday" in t)
print(birthday.text.strip())

Unfortunately, there isn't a really 100% guarantee way of doing this because the li tags don't have any unique identifiers.不幸的是,并没有真正 100% 保证这样做的方法,因为 li 标签没有任何唯一标识符。 The best way to do this is to find the closest parent tag that has a unique identifier and parse your birthday from there.最好的方法是找到最近的具有唯一标识符的父标签并从那里解析您的生日。

In this case it would look something like:在这种情况下,它看起来像:

from bs4 import BeautifulSoup
import requests

def get_source(url):
    return BeautifulSoup(requests.get(url, headers={"User-Agent": "Mozilla/5.0"}).text, 'html.parser')

soup = get_source('url')

ul_list = soup.find('ul', class_='key-dates') # Gets the parent ul tag with the class='key-dates' and children
list_item = ul_list.find('li', text='Birthday: Monday 26 April 2021') # gets the li item you need containing whatever you pass in the text parameter.

print(list_item)                             # <li>Birthday: Monday 26 April 2021</li>
print(list_item.text)                        # Birthday: Monday 26 April 2021
print(list_item.text.split('Birthday: ')[1]) # Monday 26 April 2021

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM