简体   繁体   中英

Access Beautiful soup element in Nested HTML

I wish to extract the director & actor elements from this parsed html output of IMDB top 250 page. How should the python one liner for it look like? The "text-muted text-small" appears multiple times, and find_all does not seem to be the optimum way to go about it.

<span class="ipl-rating-selector__rating-value">0</span>
</div>
<div class="ipl-rating-selector__error ipl-rating-selector__wrapper">
<span>Error: please try again.</span>
</div>
</div>
<div class="ipl-rating-interactive__loader">
<img alt="loading" src="https://m.media-amazon.com/images/G/01/IMDb/spinning-progress.gif"/>
</div>
</div>
</div>
<div class="inline-block ratings-metascore">
<span class="metascore favorable">80        </span>
        Metascore
        </div>
<p class="">
    Two imprisoned men bond over a number of years, finding solace and eventual redemption through acts of common decency.</p>
<p class="text-muted text-small">
    Director:
<a href="/name/nm0001104/">Frank Darabont</a>
<span class="ghost">|</span> 
    Stars:
<a href="/name/nm0000209/">Tim Robbins</a>, 
<a href="/name/nm0000151/">Morgan Freeman</a>, 
<a href="/name/nm0348409/">Bob Gunton</a>, 
<a href="/name/nm0006669/">William Sadler</a>
</p>
<p class="text-muted text-small">
<span class="text-muted">Votes:</span>
<span data-value="2187696" name="nv">2,187,696</span>
<span class="ghost">|</span> <span class="text-muted">Gross:</span>
<span data-value="28,341,469" name="nv">$28.34M</span>
</p>
<div class="wtw-option-standalone" data-baseref="wl_li" data-tconst="tt0111161" data-watchtype="minibar"></div>
</div>

If you are using BeautifulSoup 4.7.0 or higher, you can use the :contains CSS selector:

soup = BeautifulSoup(your_html)
soup.select_one('p:contains("Director:","Stars:")')

This will select the containing p tag and iterate over it's children, printing out Directors and Actors separately:

director_and_stars_tag = soup.select_one('p:contains("Director:")')
directors_flag = True

for name_tag in director_and_stars_tag.findChildren():
    if directors_flag:
        # These are Director tags
        if ('span' in name_tag.name):
            directors_flag = False
        else:
            print('Director: %s' % name_tag.string)
    else:
        # These are Actor tags
        print('Actor: %s' % name_tag.string)

Output:

Director: Frank Darabont
Actor: Tim Robbins
Actor: Morgan Freeman
Actor: Bob Gunton
Actor: William Sadler

If there's no id or class that you can use to identify those specific elements, You can simply iterate through your items and check if they contain what you're looking for.
A working example on your html sample would be

details = soup.find_all("p", attrs={"class": "text-muted text-small"})
for element in details:
    if "Stars" in element.text:
        stars = element.find_all("a")
        for star in stars:
            print(star.text)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM