无法在python中使用selenium抓取动态网页

Question

我试图从这个页面上抓取所有 5000 家公司。 当我向下滚动时，它的动态页面和公司被加载。 但是我只能刮5 家公司，那么我怎么能刮所有 5000 家呢？ 当我向下滚动页面时，URL 正在发生变化。 我试过硒但没有用。 https://www.inc.com/profile/onetrust注意：我想抓取公司的所有信息，但刚刚选择了两个。

import time
from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup
from selenium import webdriver
from selenium.webdriver.chrome.options import Options

my_url = 'https://www.inc.com/profile/onetrust'

options = Options()
driver = webdriver.Chrome(chrome_options=options)
driver.get(my_url)
time.sleep(3)
page = driver.page_source
driver.quit()

uClient = uReq(my_url)
page_html = uClient.read()
uClient.close()

page_soup = soup(page_html, "html.parser")

containers = page_soup.find_all("div", class_="sc-prOVx cTseUq company-profile")
container = containers[0]

for container in containers:
    rank = container.h2.get_text()
    company_name_1 = container.find_all("h2", class_="sc-AxgMl LXebc h2")
    Company_name = company_name_1[0].get_text()


    print("rank :" + rank)
    print("Company_name :" + Company_name)

更新的代码但页面根本不滚动。 更正了 BeautifulSoup 代码中的一些错误

import time
from bs4 import BeautifulSoup as soup
from selenium import webdriver

my_url = 'https://www.inc.com/profile/onetrust'

driver = webdriver.Chrome()
driver.get(my_url)


def scroll_down(self):
    """A method for scrolling the page."""

    # Get scroll height.
    last_height = self.driver.execute_script("return document.body.scrollHeight")

    while True:

        # Scroll down to the bottom.
        self.driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")

        # Wait to load the page.
        time.sleep(2)

        # Calculate new scroll height and compare with last scroll height.
        new_height = self.driver.execute_script("return document.body.scrollHeight")

        if new_height == last_height:

            break

        last_height = new_height


page_soup = soup(driver.page_source, "html.parser")

containers = page_soup.find_all("div", class_="sc-prOVx cTseUq company-profile")
container = containers[0]

for container in containers:
    rank = container.h2.get_text()
    company_name_1 = container.find_all("h2", class_="sc-AxgMl LXebc h2")
    Company_name = company_name_1[0].get_text()


    print("rank :" + rank)
    print("Company_name :" + Company_name)

感谢您的阅读！

Answer 1

尝试以下使用 python 的方法 -请求简单、直接、可靠、快速并且在请求方面需要更少的代码。 在检查 google chrome 浏览器的网络部分后，我从网站本身获取了 API URL。

下面的脚本正在做什么：

首先它将获取 API URL 并执行 GET 请求。
获取数据后，脚本将使用 json.loads 库解析 JSON 数据。

最后，它将遍历所有公司列表并打印它们，例如：- 排名、公司名称、社交帐户链接、CEO 姓名等。

 import json import requests from urllib3.exceptions import InsecureRequestWarning requests.packages.urllib3.disable_warnings(InsecureRequestWarning) def scrap_inc_5000(): URL = 'https://www.inc.com/rest/companyprofile/nuleaf-naturals/withlist' response = requests.get(URL,verify = False) result = json.loads(response.text) #Parse result using JSON loads extracted_data = result['fullList']['listCompanies'] for data in extracted_data: print('-' * 100) print('Rank : ',data['rank']) print('Company : ',data['company']) print('Icon : ',data['icon']) print('CEO Name : ',data['ifc_ceo_name']) print('Facebook Address : ',data['ifc_facebook_address']) print('File Location : ',data['ifc_filelocation']) print('Linkedin Address : ',data['ifc_linkedin_address']) print('Twitter Handle : ',data['ifc_twitter_handle']) print('Secondary Link : ',data['secondary_link']) print('-' * 100) scrap_inc_5000()

无法在python中使用selenium抓取动态网页

问题描述

1 个解决方案

解决方案1
0 已采纳 2020-11-06 11:47:23

无法在python中使用selenium抓取动态网页

问题描述

1 个解决方案

解决方案1 0 已采纳 2020-11-06 11:47:23

解决方案1
0 已采纳 2020-11-06 11:47:23