使用 Python 和 Selenium 抓取大量页面（完整网站）

Question

我正在尝试用 python 和 selenium 废弃整个网页。 我正在提供网站 URL（即https://example.com ），然后它将废弃主页，然后它将通过从第一页获取链接继续递归地为所有子页面执行此操作。 我面临严重的性能问题，因为页数可能超过 500，并且 selenium 将在执行过程中停止。

仅供参考，我正在使用 Selenium，因为我需要扫描基于 JS 的网页（内容由 JavaScript 生成）。 有没有关于如何更有效地报废整个网页内容的最佳实践？ 或其他类似的库？

Answer 1

这是一个带有concurrent.futures.ThreadPoolExecutor模块的解决方案，

下面的代码将并行运行多个硒抓取过程


def product_parser(product_links):

    # your selenium code

    driver = webdriver.Chrome(
        executable_path=CHROME_DRIVER_PATH,
        options=chrome_options,
    )
    ........
    ........


# create a list of 10 lists with product links
list_chunked_product_links = [[], [], [],......[]]  

# an example, it will create a list of 10 lists consists of max 1000 elements each
# [product_links[x:x + 1000] for x in range(0, len(product_links) - 2, 1000)]


# run 10 drivers concurrently
def run_all(product_links):
    with concurrent.futures.ThreadPoolExecutor(max_workers=10) as executor:
        executor.map(product_parser, product_links)


def main_func():
    start_time = time.time()
    t_start = time.localtime()
    run_all(list_chunked_product_links)
    duration = time.time() - start_time
    print(f"start time : {time.strftime('%H:%M:%S', t_start)}, \n End Time : {time.strftime('%H:%M:%S')} ")
    print(f"Scraped {len(product_links)} in {duration} seconds")

使用 Python 和 Selenium 抓取大量页面（完整网站）

问题描述

1 个解决方案

解决方案1
0 2022-06-01 14:54:13

使用 Python 和 Selenium 抓取大量页面（完整网站）

问题描述

1 个解决方案

解决方案1 0 2022-06-01 14:54:13

解决方案1
0 2022-06-01 14:54:13