为什么我在使用 selenium 时只获取首页数据？

Question

I use the python package selenium to click the "load more" button automatically, which is successful.我使用python包selenium自动点击“加载更多”按钮，成功。 But why do I cannot get data after "load more"?但是为什么我在“加载更多”后无法获取数据？

I want to crawl reviews from imdb using python.我想使用 python 从 imdb 抓取评论。 It only displays 25 reviews until I click "load more" button.它只显示 25 条评论，直到我点击“加载更多”按钮。 I use the python package selenium to click the "load more" button automatically, which is successful.我使用python包selenium自动点击“加载更多”按钮，成功。 But why do I cannot get data after "load more" and just get the first 25 reviews data repeatedly?但是为什么我在“加载更多”后无法获取数据并且重复获取前25条评论数据？

import requests
from bs4 import BeautifulSoup
from selenium import webdriver      
import time



seed = 'https://www.imdb.com/title/tt4209788/reviews'
movie_review = requests.get(seed)
PATIENCE_TIME = 60
LOAD_MORE_BUTTON_XPATH = '//*[@id="browse-itemsprimary"]/li[2]/button/span/span[2]' 

driver = webdriver.Chrome('D:/chromedriver_win32/chromedriver.exe')
driver.get(seed)

while True:
    try:
        loadMoreButton = driver.find_element_by_xpath("//button[@class='ipl-load-more__button']")

        review_soup = BeautifulSoup(movie_review.text, 'html.parser')
        review_containers = review_soup.find_all('div', class_ ='imdb-user-review')
        print('length: ',len(review_containers))
        for review_container in review_containers:
            review_title = review_container.find('a', class_ = 'title').text
            print(review_title)

        time.sleep(2)
        loadMoreButton.click()
        time.sleep(5)
    except Exception as e:
        print(e)
        break

print("Complete")

I want all the reviews, but now I can only get the first 25.我想要所有的评论，但现在我只能得到前 25 条。

Answer 1

You have several issues in your script.您的脚本中有几个问题。 Hardcoded wait is very inconsistent and certainly the worst option to comply.硬编码等待非常不一致，当然是最糟糕的选择。 The way you have written your scraping logic within while True: loop, will slower the parsing process by collecting the same items over and over again.您在while True:循环中编写抓取逻辑的方式将通过一遍又一遍地收集相同的项目来减慢解析过程。 Moreover, every title produces a huge line gap in the output which needs to be properly stripped.此外，每个标题都会在输出中产生巨大的行距，需要适当地剥离。 I've slightly changed your script to reflect the suggestion I've given above.我稍微更改了您的脚本以反映我上面给出的建议。

Try this to get the required output:试试这个以获得所需的输出：

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from bs4 import BeautifulSoup

URL = "https://www.imdb.com/title/tt4209788/reviews"

driver = webdriver.Chrome()
wait = WebDriverWait(driver,10)

driver.get(URL)
soup = BeautifulSoup(driver.page_source, 'lxml')

while True:
    try:
        driver.find_element_by_css_selector("button#load-more-trigger").click()
        wait.until(EC.invisibility_of_element_located((By.CSS_SELECTOR,".ipl-load-more__load-indicator")))
        soup = BeautifulSoup(driver.page_source, 'lxml')
    except Exception:break

for elem in soup.find_all(class_='imdb-user-review'):
    name = elem.find(class_='title').get_text(strip=True)
    print(name)

driver.quit()

Answer 2

Your code is fine.你的代码没问题。 Great even.甚至很棒。 But, you never fetch the 'updated' HTML for the web page after hitting the 'Load More' button.但是，您永远不会在点击“加载更多”按钮后获取网页的“更新”HTML。 That's why you are getting the same 25 reviews listed all the time.这就是为什么您总是收到相同的 25 条评论。

When you use Selenium to control the web browser, you are clicking the 'Load More' button.当您使用 Selenium 控制 Web 浏览器时，您正在单击“加载更多”按钮。 This creates an XHR request (or more commonly called AJAX request) that you can see in the 'Network' tab of your web browser's developer tools.这将创建一个 XHR 请求（或更常见的称为 AJAX 请求），您可以在 Web 浏览器的开发人员工具的“网络”选项卡中看到该请求。

The bottom line is that JavaScript (which is run in the web browser ) updates the page.底线是 JavaScript（在 Web 浏览器中运行）更新页面。 But in your Python program, you only get the HTML once for the page statically using the Requests library.但是在您的 Python 程序中，您只能使用 Requests 库静态地获取页面的 HTML一次。

seed = 'https://www.imdb.com/title/tt4209788/reviews'
movie_review = requests.get(seed) #<-- SEE HERE? This is always the same HTML. You fetched in once in the beginning.
PATIENCE_TIME = 60

To fix this problem, you need to use Selenium to get the innerHTML of the div box containing the reviews.要解决此问题，您需要使用 Selenium 获取包含评论的 div 框的 innerHTML。 Then, have BeautifulSoup parse the HTML again.然后，让 BeautifulSoup 再次解析 HTML。 We want to avoid picking up the entire page's HTML again and again because it takes computation resources to have to parse that updated HTML over and over again.我们希望避免一次又一次地获取整个页面的 HTML，因为它需要计算资源来一遍又一遍地解析更新的 HTML。

So, find the div on the page that contains the reviews, and parse it again with BeautifulSoup.因此，在包含评论的页面上找到 div，然后使用 BeautifulSoup 再次解析它。 Something like this should work:这样的事情应该工作：

while True:
    try:
        allReviewsDiv = driver.find_element_by_xpath("//div[@class='lister-list']")
        allReviewsHTML = allReviewsDiv.get_attribute('innerHTML')
        loadMoreButton = driver.find_element_by_xpath("//button[@class='ipl-load-more__button']")
        review_soup = BeautifulSoup(allReviewsHTML, 'html.parser')
        review_containers = review_soup.find_all('div', class_ ='imdb-user-review')
        pdb.set_trace()
        print('length: ',len(review_containers))
        for review_container in review_containers:
            review_title = review_container.find('a', class_ = 'title').text
            print(review_title)

        time.sleep(2)
        loadMoreButton.click()
        time.sleep(5)
    except Exception as e:
        print(e)
        break

为什么我在使用 selenium 时只获取首页数据？

问题描述

2 个解决方案

解决方案1
2 2019-04-05 06:41:30

解决方案2
0 2019-04-05 03:35:45

为什么我在使用 selenium 时只获取首页数据？

问题描述

2 个解决方案

解决方案1 2 2019-04-05 06:41:30

解决方案2 0 2019-04-05 03:35:45

解决方案1
2 2019-04-05 06:41:30

解决方案2
0 2019-04-05 03:35:45