简体   繁体   English

在嵌套 HTML 中访问美丽的汤元素

[英]Access Beautiful soup element in Nested HTML

I wish to extract the director & actor elements from this parsed html output of IMDB top 250 page.我希望从 IMDB 前 250 页的这个解析的 html 输出中提取导演和演员元素。 How should the python one liner for it look like? python one liner for it应该是什么样子的? The "text-muted text-small" appears multiple times, and find_all does not seem to be the optimum way to go about it. “text-muted text-small”出现多次,而 find_all 似乎不是解决它的最佳方法。

<span class="ipl-rating-selector__rating-value">0</span>
</div>
<div class="ipl-rating-selector__error ipl-rating-selector__wrapper">
<span>Error: please try again.</span>
</div>
</div>
<div class="ipl-rating-interactive__loader">
<img alt="loading" src="https://m.media-amazon.com/images/G/01/IMDb/spinning-progress.gif"/>
</div>
</div>
</div>
<div class="inline-block ratings-metascore">
<span class="metascore favorable">80        </span>
        Metascore
        </div>
<p class="">
    Two imprisoned men bond over a number of years, finding solace and eventual redemption through acts of common decency.</p>
<p class="text-muted text-small">
    Director:
<a href="/name/nm0001104/">Frank Darabont</a>
<span class="ghost">|</span> 
    Stars:
<a href="/name/nm0000209/">Tim Robbins</a>, 
<a href="/name/nm0000151/">Morgan Freeman</a>, 
<a href="/name/nm0348409/">Bob Gunton</a>, 
<a href="/name/nm0006669/">William Sadler</a>
</p>
<p class="text-muted text-small">
<span class="text-muted">Votes:</span>
<span data-value="2187696" name="nv">2,187,696</span>
<span class="ghost">|</span> <span class="text-muted">Gross:</span>
<span data-value="28,341,469" name="nv">$28.34M</span>
</p>
<div class="wtw-option-standalone" data-baseref="wl_li" data-tconst="tt0111161" data-watchtype="minibar"></div>
</div>

If you are using BeautifulSoup 4.7.0 or higher, you can use the :contains CSS selector:如果您使用的是 BeautifulSoup 4.7.0 或更高版本,则可以使用:contains CSS 选择器:

soup = BeautifulSoup(your_html)
soup.select_one('p:contains("Director:","Stars:")')

This will select the containing p tag and iterate over it's children, printing out Directors and Actors separately:这将选择包含 p 标签并迭代它的孩子,分别打印出导演和演员:

director_and_stars_tag = soup.select_one('p:contains("Director:")')
directors_flag = True

for name_tag in director_and_stars_tag.findChildren():
    if directors_flag:
        # These are Director tags
        if ('span' in name_tag.name):
            directors_flag = False
        else:
            print('Director: %s' % name_tag.string)
    else:
        # These are Actor tags
        print('Actor: %s' % name_tag.string)

Output:输出:

Director: Frank Darabont
Actor: Tim Robbins
Actor: Morgan Freeman
Actor: Bob Gunton
Actor: William Sadler

If there's no id or class that you can use to identify those specific elements, You can simply iterate through your items and check if they contain what you're looking for.如果没有可用于标识这些特定元素的 id 或类,您可以简单地遍历您的项目并检查它们是否包含您要查找的内容。
A working example on your html sample would be您的 html 示例上的一个工作示例是

details = soup.find_all("p", attrs={"class": "text-muted text-small"})
for element in details:
    if "Stars" in element.text:
        stars = element.find_all("a")
        for star in stars:
            print(star.text)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM