简体   繁体   English

Python BeautifulSoup 循环遍历 div 和多个元素

[英]Python BeautifulSoup Loop through divs and multiple elements

I have a website containing film listings, I've put together a simplified HTML of the website.我有一个包含电影列表的网站,我已经将网站的简化 HTML 放在一起。 Please note that for the real world example the <ul> tags are not direct children of the class film_listing or showtime.请注意,对于现实世界的例子, <ul>标签不是 film_listing 或 showtime 类的直接子级。 They are found under several <div> or <ul> elements.它们位于多个<div><ul>元素下。

<li class="film_listing">
       <h3 class="film_title">James Bond</h3>
       <ul class="showtimes">
              <li class="showtime">
                     <p class="start_time">15:00</p>
              </li>
              <li class="showtime">
                     <p class="start_time">19:00</p>
                     <ul class="attributes">
                            <li class="audio_desc">
                            </li>
                            <li class="open_cap">
                            </li>
                     </ul>
              </li>
       </ul>
</li>

I have created a Python script to scrape the website which currently lists all film titles with the first showtime and first attribute of each.我创建了一个 Python 脚本来抓取网站,该网站当前列出了所有电影片名,并带有每个片名的第一个放映时间和第一个属性。 However, I am trying to list all showtimes.但是,我正在尝试列出所有放映时间。 The final aim is to only list film titles with open captions and the showtime of those open captions performances.最终目的是仅列出带有开放字幕的电影名称以及这些开放字幕表演的放映时间。

Here is the python script with a nested for loop that doesn't work and prints all showtimes for all films, rather than showtimes for a specific film.这是带有嵌套 for 循环的 python 脚本,该循环不起作用并打印所有电影的所有放映时间,而不是特定电影的放映时间。 It is also not set up to only list captioned films yet.它还没有设置为仅列出带字幕的电影。 I suspect the logic may be wrong and would appreciate any advice.我怀疑逻辑可能是错误的,并希望得到任何建议。 Thanks!谢谢!

for i in soup.findAll('li', {'class':'film_listing'}):
    film_title=i.find('h3', {'class':'film_title'}).text  
    print(film_title)
 
    for j in soup.findAll('li', {'class':'showtime'}):
            print(j['showtime.text'])   

    #For the time listings, find ones with Open Captioned
    i=filmlisting.find('li', {'class':'open_cap'})
    print(film_access)

edit: small correction to html script编辑:对 html 脚本的小更正

There are many ways how you could extract the information.有很多方法可以提取信息。 One way is to "search backwards" .一种方法是“向后搜索” Search for <li> with class="open_cap" and the find previous start time and film title:使用class="open_cap"搜索<li>并找到以前的开始时间和电影标题:

from bs4 import BeautifulSoup


txt = '''
<li class="film_listing">
       <h3 class="film_title">James Bond</h3>
       <ul class="showtimes">
              <li class="showtime">
                     <p class="start_time">15:00</p>
              </li>
              <li class="showtime">
                     <p class="start_time">19:00</p>
                     <ul class="attributes">
                            <li class="audio_desc">
                            </li>
                            <li class="open_cap">
                            </li>
                     </ul>
              </li>
       </ul>
</li>'''

soup = BeautifulSoup(txt, 'html.parser')


for open_cap in soup.select('.open_cap'):
    print('Name       :', open_cap.find_previous(class_='film_title').text)
    print('Start time :', open_cap.find_previous(class_='start_time').text)
    print('-' * 80)

Prints:印刷:

Name       : James Bond
Start time : 19:00
--------------------------------------------------------------------------------

Content of read.html read.html内容

<li class="film_listing">
  <h3 class="film_title">James Bond</h3>
  <ul class="showtimes">
    <li class="showtime">
      <p class="start_time">15: 00</p>
    </li>
    <li class="showtime">
      <p class="start_time">19:00</p>
      <ul class="attributes">
        <li class="audio_desc"></li>
        <li class="open_cap"></li>
      </ul>
    </li>
  </ul>
</li>

As you said <ul> tags are not direct children of the class film_listing or showtime then you can try find() to get first element with specified tag name or you can use find_all() to get list of elements with specified tag name.正如您所说的<ul>标签不是film_listingshowtime类的直接子代,那么您可以尝试find()获取具有指定标签名称的第一个元素,或者您可以使用find_all()获取具有指定标签名称的元素列表。 You can try this你可以试试这个

    from bs4 import BeautifulSoup as bs
    
    text = open("read.html", "r")
    
    soup = bs(text.read(), 'html.parser')
    
    for listing in soup.find_all("li", class_="film_listing"):
        print("Film name: ", listing.find("h3", class_="film_title").text)
        print("Start time: ", listing.find("p", class_="start_time").text)
   

Output:输出:

Film name:  James Bond
Start time:  15: 00

instead of find() you can use find_all() method which will return all the tags which that name <p> and class start_time您可以使用find_all()方法代替find()方法,该方法将返回名称为<p>和类start_time所有标签

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM