简体   繁体   中英

Web scrape records from IMDB using selenium and clicking next page

I am trying to web scrape the top episodes from IMDB, and at first I implemented it using beautiful soup to get the first 10,000 records and that worked fine. However, after the 10,000 records, the IMDB link changes from numbers to random strings of letter for the next page as shown below.

I want to be able to navigate from this page: https://www.imdb.com/search/title/?title_type=tv_episode&num_votes=600,&sort=user_rating,desc&start=9951&ref_=adv_nxt

to the next page:

https://www.imdb.com/search/title/?title_type=tv_episode&num_votes=600,&sort=user_rating,desc&after=WzguNSwidHQwOTQzNjU3IiwxMDAwMV0%3D&ref_=adv_nxt

Then scrape all the records from pages after that by clicking on the next button. I want to use selenium but I have not been able to get it to work. Any help is appreciated

code:

import time
import urllib.request
from bs4 import BeautifulSoup
from selenium import webdriver
import pandas as pd
import numpy as np
import requests

url = "https://www.imdb.com/search/title/?title_type=tv_episode&num_votes=600,&sort=user_rating,desc"

driver = webdriver.Chrome("chromedriver.exe")
driver.get(url)

page = 1

series_name = []
episode_name = []

while page != 9951:
    url = f"https://www.imdb.com/search/title/?title_type=tv_episode&num_votes=600,&sort=user_rating,desc&start={page}&ref_=adv_nxt"
    response = requests.get(url)
    soup = BeautifulSoup(response.content, 'html.parser')

    episode_data = soup.findAll('div', attrs={'class': 'lister-item mode-advanced'})
    for store in episode_data:
        h3=store.find('h3', attrs={'class': 'lister-item-header'})
        sName =h3.findAll('a')[0].text
        series_name.append(sName)
        eName = h3.findAll('a')[1].text
        episode_name.append(eName)

    time.sleep(2)

    page += 50

Note: Selenium is an option, but it is not needed to complete the task - Also use find_all() in newer code instead of old syntax findAll()

Use requests and say goodbye to the focus on the number of pages - instead use the url provided by the attribute href of next element.

if (a := soup.select_one('a[href].next-page')):
    url = 'https://www.imdb.com'+a['href']
else:
    break

Example

To show that it is working initial url is set to &start=9951 you can remove this to start from first page if you like:

import time
from bs4 import BeautifulSoup
import requests

url = "https://www.imdb.com/search/title/?title_type=tv_episode&num_votes=600,&sort=user_rating,desc&start=9951"

series_name = []
episode_name = []

while True:
    response = requests.get(url)
    soup = BeautifulSoup(response.content, 'html.parser')

    episode_data = soup.find_all('div', attrs={'class': 'lister-item mode-advanced'})
    for store in episode_data:
        h3=store.find('h3', attrs={'class': 'lister-item-header'})
        sName =h3.find_all('a')[0].text
        series_name.append(sName)
        eName = h3.find_all('a')[1].text
        episode_name.append(eName)

    time.sleep(2)

    if (a := soup.select_one('a[href].next-page')):
        url = 'https://www.imdb.com'+a['href']
    else:
        break

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM