简体   繁体   English

为什么我在使用 selenium 时只获取首页数据?

[英]Why do I only get first page data when using selenium?

I use the python package selenium to click the "load more" button automatically, which is successful.我使用python包selenium自动点击“加载更多”按钮,成功。 But why do I cannot get data after "load more"?但是为什么我在“加载更多”后无法获取数据?

I want to crawl reviews from imdb using python.我想使用 python 从 imdb 抓取评论。 It only displays 25 reviews until I click "load more" button.它只显示 25 条评论,直到我点击“加载更多”按钮。 I use the python package selenium to click the "load more" button automatically, which is successful.我使用python包selenium自动点击“加载更多”按钮,成功。 But why do I cannot get data after "load more" and just get the first 25 reviews data repeatedly?但是为什么我在“加载更多”后无法获取数据并且重复获取前25条评论数据?

import requests
from bs4 import BeautifulSoup
from selenium import webdriver      
import time



seed = 'https://www.imdb.com/title/tt4209788/reviews'
movie_review = requests.get(seed)
PATIENCE_TIME = 60
LOAD_MORE_BUTTON_XPATH = '//*[@id="browse-itemsprimary"]/li[2]/button/span/span[2]' 

driver = webdriver.Chrome('D:/chromedriver_win32/chromedriver.exe')
driver.get(seed)

while True:
    try:
        loadMoreButton = driver.find_element_by_xpath("//button[@class='ipl-load-more__button']")

        review_soup = BeautifulSoup(movie_review.text, 'html.parser')
        review_containers = review_soup.find_all('div', class_ ='imdb-user-review')
        print('length: ',len(review_containers))
        for review_container in review_containers:
            review_title = review_container.find('a', class_ = 'title').text
            print(review_title)

        time.sleep(2)
        loadMoreButton.click()
        time.sleep(5)
    except Exception as e:
        print(e)
        break

print("Complete")

I want all the reviews, but now I can only get the first 25.我想要所有的评论,但现在我只能得到前 25 条。

You have several issues in your script.您的脚本中有几个问题。 Hardcoded wait is very inconsistent and certainly the worst option to comply.硬编码等待非常不一致,当然是最糟糕的选择。 The way you have written your scraping logic within while True: loop, will slower the parsing process by collecting the same items over and over again.您在while True:循环中编写抓取逻辑的方式将通过一遍又一遍地收集相同的项目来减慢解析过程。 Moreover, every title produces a huge line gap in the output which needs to be properly stripped.此外,每个标题都会在输出中产生巨大的行距,需要适当地剥离。 I've slightly changed your script to reflect the suggestion I've given above.我稍微更改了您的脚本以反映我上面给出的建议。

Try this to get the required output:试试这个以获得所需的输出:

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from bs4 import BeautifulSoup

URL = "https://www.imdb.com/title/tt4209788/reviews"

driver = webdriver.Chrome()
wait = WebDriverWait(driver,10)

driver.get(URL)
soup = BeautifulSoup(driver.page_source, 'lxml')

while True:
    try:
        driver.find_element_by_css_selector("button#load-more-trigger").click()
        wait.until(EC.invisibility_of_element_located((By.CSS_SELECTOR,".ipl-load-more__load-indicator")))
        soup = BeautifulSoup(driver.page_source, 'lxml')
    except Exception:break

for elem in soup.find_all(class_='imdb-user-review'):
    name = elem.find(class_='title').get_text(strip=True)
    print(name)

driver.quit()

Your code is fine.你的代码没问题。 Great even.甚至很棒。 But, you never fetch the 'updated' HTML for the web page after hitting the 'Load More' button.但是,您永远不会在点击“加载更多”按钮后获取网页的“更新”HTML。 That's why you are getting the same 25 reviews listed all the time.这就是为什么您总是收到相同的 25 条评论。

When you use Selenium to control the web browser, you are clicking the 'Load More' button.当您使用 Selenium 控制 Web 浏览器时,您正在单击“加载更多”按钮。 This creates an XHR request (or more commonly called AJAX request) that you can see in the 'Network' tab of your web browser's developer tools.这将创建一个 XHR 请求(或更常见的称为 AJAX 请求),您可以在 Web 浏览器的开发人员工具的“网络”选项卡中看到该请求。

The bottom line is that JavaScript (which is run in the web browser ) updates the page.底线是 JavaScript(在 Web 浏览器中运行)更新页面。 But in your Python program, you only get the HTML once for the page statically using the Requests library.但是在您的 Python 程序中,您只能使用 Requests 库静态地获取页面的 HTML一次

seed = 'https://www.imdb.com/title/tt4209788/reviews'
movie_review = requests.get(seed) #<-- SEE HERE? This is always the same HTML. You fetched in once in the beginning.
PATIENCE_TIME = 60

To fix this problem, you need to use Selenium to get the innerHTML of the div box containing the reviews.要解决此问题,您需要使用 Selenium 获取包含评论的 div 框的 innerHTML。 Then, have BeautifulSoup parse the HTML again.然后,让 BeautifulSoup 再次解析 HTML。 We want to avoid picking up the entire page's HTML again and again because it takes computation resources to have to parse that updated HTML over and over again.我们希望避免一次又一次地获取整个页面的 HTML,因为它需要计算资源来一遍又一遍地解析更新的 HTML。

So, find the div on the page that contains the reviews, and parse it again with BeautifulSoup.因此,在包含评论的页面上找到 div,然后使用 BeautifulSoup 再次解析它。 Something like this should work:这样的事情应该工作:

while True:
    try:
        allReviewsDiv = driver.find_element_by_xpath("//div[@class='lister-list']")
        allReviewsHTML = allReviewsDiv.get_attribute('innerHTML')
        loadMoreButton = driver.find_element_by_xpath("//button[@class='ipl-load-more__button']")
        review_soup = BeautifulSoup(allReviewsHTML, 'html.parser')
        review_containers = review_soup.find_all('div', class_ ='imdb-user-review')
        pdb.set_trace()
        print('length: ',len(review_containers))
        for review_container in review_containers:
            review_title = review_container.find('a', class_ = 'title').text
            print(review_title)

        time.sleep(2)
        loadMoreButton.click()
        time.sleep(5)
    except Exception as e:
        print(e)
        break

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 在jenkins中使用硒时,为什么会出现语法错误? - Why do I get syntax error when using selenium in jenkins? 为什么我只能从第一个 web 页而不是所有 25 页中获取产品详细信息 - Why do I get product details ONLY from the first web page instead of all the 25pages 当我从Django网站使用PyVirtualDisplay运行Selenium时,为什么会出现gnome权限错误? - Why do I get a gnome permissions error when I run Selenium using PyVirtualDisplay from a Django website? 如何使用BeautifulSoup仅获取Wikipedia页面上第一张表的数据? - How to only get data of first table on a Wikipedia page using BeautifulSoup? 如何仅使用 pdfkit 生成 PDF 的第一页? - How do I only generate the first page of a PDF using pdfkit? 为什么Selenium仅获取页面上第一个工具提示的文本? - Why is Selenium only fetching the text of the first ToolTip on the page? 使用 selenium 时只获取数据;。 页 - While using selenium only getting the data;. page 为什么在 dynamoDB 中插入项目时,仅在 1 种情况下使用相同的变量会出现 RecursionError - Why do I get RecursionError only for 1 case using the same variable when inserting an item in dynamoDB 为什么第一次使用硒运行时会在firefox中打开空白页? - why blank page opens in firefox when run with selenium for the first time? 为什么在 Tomcat 中使用 Python CGI 时出现空白页 - Why do I get a blank page when using Python CGI in Tomcat
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM