简体   繁体   English

Web 使用 selenium 从 IMDB 抓取记录并单击下一页

[英]Web scrape records from IMDB using selenium and clicking next page

I am trying to web scrape the top episodes from IMDB, and at first I implemented it using beautiful soup to get the first 10,000 records and that worked fine.我正在尝试 web 从 IMDB 中抓取热门剧集,起初我使用 beautiful soup 实现它以获得前 10,000 条记录并且效果很好。 However, after the 10,000 records, the IMDB link changes from numbers to random strings of letter for the next page as shown below.然而,在 10,000 条记录之后,IMDB 链接从数字变为下一页的随机字符串,如下所示。

I want to be able to navigate from this page: https://www.imdb.com/search/title/?title_type=tv_episode&num_votes=600,&sort=user_rating,desc&start=9951&ref_=adv_nxt我希望能够从此页面导航: https://www.imdb.com/search/title/?title_type=tv_episode&num_votes=600,&sort=user_rating,desc&start=9951&ref_=adv_nxt

to the next page:到下一页:

https://www.imdb.com/search/title/?title_type=tv_episode&num_votes=600,&sort=user_rating,desc&after=WzguNSwidHQwOTQzNjU3IiwxMDAwMV0%3D&ref_=adv_nxt https://www.imdb.com/search/title/?title_type=tv_episode&num_votes=600,&sort=user_rating,desc&after=WzguNSwidHQwOTQzNjU3IiwxMDAwMV0%3D&ref_=adv_nxt

Then scrape all the records from pages after that by clicking on the next button.然后通过单击下一步按钮从之后的页面中抓取所有记录。 I want to use selenium but I have not been able to get it to work.我想使用 selenium,但无法正常使用。 Any help is appreciated任何帮助表示赞赏

code:代码:

import time
import urllib.request
from bs4 import BeautifulSoup
from selenium import webdriver
import pandas as pd
import numpy as np
import requests

url = "https://www.imdb.com/search/title/?title_type=tv_episode&num_votes=600,&sort=user_rating,desc"

driver = webdriver.Chrome("chromedriver.exe")
driver.get(url)

page = 1

series_name = []
episode_name = []

while page != 9951:
    url = f"https://www.imdb.com/search/title/?title_type=tv_episode&num_votes=600,&sort=user_rating,desc&start={page}&ref_=adv_nxt"
    response = requests.get(url)
    soup = BeautifulSoup(response.content, 'html.parser')

    episode_data = soup.findAll('div', attrs={'class': 'lister-item mode-advanced'})
    for store in episode_data:
        h3=store.find('h3', attrs={'class': 'lister-item-header'})
        sName =h3.findAll('a')[0].text
        series_name.append(sName)
        eName = h3.findAll('a')[1].text
        episode_name.append(eName)

    time.sleep(2)

    page += 50

Note: Selenium is an option, but it is not needed to complete the task - Also use find_all() in newer code instead of old syntax findAll()注意: Selenium 是一个选项,但它不是完成任务所必需的 - 也可以在较新的代码中使用find_all()而不是旧语法findAll()

Use requests and say goodbye to the focus on the number of pages - instead use the url provided by the attribute href of next element.使用 requests 告别对页数的关注——取而代之的是使用 next 元素的属性href提供的 url。

if (a := soup.select_one('a[href].next-page')):
    url = 'https://www.imdb.com'+a['href']
else:
    break

Example例子

To show that it is working initial url is set to &start=9951 you can remove this to start from first page if you like:为了表明它正在工作,初始 url 设置为&start=9951如果您愿意,您可以删除它以从第一页开始:

import time
from bs4 import BeautifulSoup
import requests

url = "https://www.imdb.com/search/title/?title_type=tv_episode&num_votes=600,&sort=user_rating,desc&start=9951"

series_name = []
episode_name = []

while True:
    response = requests.get(url)
    soup = BeautifulSoup(response.content, 'html.parser')

    episode_data = soup.find_all('div', attrs={'class': 'lister-item mode-advanced'})
    for store in episode_data:
        h3=store.find('h3', attrs={'class': 'lister-item-header'})
        sName =h3.find_all('a')[0].text
        series_name.append(sName)
        eName = h3.find_all('a')[1].text
        episode_name.append(eName)

    time.sleep(2)

    if (a := soup.select_one('a[href].next-page')):
        url = 'https://www.imdb.com'+a['href']
    else:
        break

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM