繁体   English   中英

无法在python中使用selenium抓取动态网页

[英]Failing to scrape dynamic webpage using selenium in python

我试图从这个页面上抓取所有 5000 家公司。 当我向下滚动时,它的动态页面和公司被加载。 但是我只能刮5 家公司,那么我怎么能刮所有 5000 家呢? 当我向下滚动页面时,URL 正在发生变化 我试过硒但没有用。 https://www.inc.com/profile/onetrust注意:我想抓取公司的所有信息,但刚刚选择了两个。

import time
from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup
from selenium import webdriver
from selenium.webdriver.chrome.options import Options

my_url = 'https://www.inc.com/profile/onetrust'

options = Options()
driver = webdriver.Chrome(chrome_options=options)
driver.get(my_url)
time.sleep(3)
page = driver.page_source
driver.quit()

uClient = uReq(my_url)
page_html = uClient.read()
uClient.close()

page_soup = soup(page_html, "html.parser")

containers = page_soup.find_all("div", class_="sc-prOVx cTseUq company-profile")
container = containers[0]

for container in containers:
    rank = container.h2.get_text()
    company_name_1 = container.find_all("h2", class_="sc-AxgMl LXebc h2")
    Company_name = company_name_1[0].get_text()


    print("rank :" + rank)
    print("Company_name :" + Company_name)

更新的代码但页面根本不滚动。 更正了 BeautifulSoup 代码中的一些错误

import time
from bs4 import BeautifulSoup as soup
from selenium import webdriver

my_url = 'https://www.inc.com/profile/onetrust'

driver = webdriver.Chrome()
driver.get(my_url)


def scroll_down(self):
    """A method for scrolling the page."""

    # Get scroll height.
    last_height = self.driver.execute_script("return document.body.scrollHeight")

    while True:

        # Scroll down to the bottom.
        self.driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")

        # Wait to load the page.
        time.sleep(2)

        # Calculate new scroll height and compare with last scroll height.
        new_height = self.driver.execute_script("return document.body.scrollHeight")

        if new_height == last_height:

            break

        last_height = new_height


page_soup = soup(driver.page_source, "html.parser")

containers = page_soup.find_all("div", class_="sc-prOVx cTseUq company-profile")
container = containers[0]

for container in containers:
    rank = container.h2.get_text()
    company_name_1 = container.find_all("h2", class_="sc-AxgMl LXebc h2")
    Company_name = company_name_1[0].get_text()


    print("rank :" + rank)
    print("Company_name :" + Company_name)

感谢您的阅读!

尝试以下使用 python 的方法 -请求简单、直接、可靠、快速并且在请求方面需要更少的代码。 在检查 google chrome 浏览器的网络部分后,我从网站本身获取了 API URL。

下面的脚本正在做什么:

  1. 首先它将获取 API URL 并执行 GET 请求。

  2. 获取数据后,脚本将使用 json.loads 库解析 JSON 数据。

  3. 最后,它将遍历所有公司列表并打印它们,例如:- 排名、公司名称、社交帐户链接、CEO 姓名等。

     import json import requests from urllib3.exceptions import InsecureRequestWarning requests.packages.urllib3.disable_warnings(InsecureRequestWarning) def scrap_inc_5000(): URL = 'https://www.inc.com/rest/companyprofile/nuleaf-naturals/withlist' response = requests.get(URL,verify = False) result = json.loads(response.text) #Parse result using JSON loads extracted_data = result['fullList']['listCompanies'] for data in extracted_data: print('-' * 100) print('Rank : ',data['rank']) print('Company : ',data['company']) print('Icon : ',data['icon']) print('CEO Name : ',data['ifc_ceo_name']) print('Facebook Address : ',data['ifc_facebook_address']) print('File Location : ',data['ifc_filelocation']) print('Linkedin Address : ',data['ifc_linkedin_address']) print('Twitter Handle : ',data['ifc_twitter_handle']) print('Secondary Link : ',data['secondary_link']) print('-' * 100) scrap_inc_5000()

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM