简体   繁体   English

Python 多处理卡在 selenium

[英]Python Multiprocessing gets stuck with selenium

So I have code that spins up 4 selenium chrome drivers and scrapes data from an element on the web pages.所以我有代码可以启动 4 个 selenium chrome 驱动程序并从 web 页面上的元素中抓取数据。 The code can be simplified to something like this:代码可以简化成这样:

import json
import multiprocessing as mp
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait

class scraper():
    def __init__(self,list_of_urls, process_num):
        self.urls = list_of_urls
        self.process_num = process_num

    def scrape_urls(self):
        driver = webdriver.Chrome(driver_dir)
        data = []
        for url in self.urls:
            driver.get(url)
            element = WebDriverWait(driver, timeout=7).until(lambda d: d.find_element(by=By.CLASS_NAME, value="InterestingData"))
            data.append(element.text)
            print("Scraper # ", self.process_num," got data from: ",url)
        return data

if __name__ == '__main__':
    with open('array_of_urls', 'r') as infile:
        urls = json.load(infile)
        number_of_processes=4
        length_of_urls = len(urls)
        partition_into = math.ceil(length_of_urls/number_of_processes)
        scrapers = []
        start = 0
        end = start + partition_into
        for num in range(number_of_processes):
            new_scraper = scraper(urls[start:end],num)
            scrapers.append(new_scraper)
            start = end
            end = start + partition_into
            if end > length_of_urls:
                end = length_of_urls-1

        with mp.Pool(processes=number_of_processes) as pool:
            result_array = []
            for num in range(number_of_processes):
                result_array.append(pool.apply_async(scrapers[num].scrape_urls))
            pool.close()
            pool.join()

The problem I am running into is that after 5-10 minutes one of the scrapers would just stop, the only thing that would wake it back up is to manually refresh the page on the browser.我遇到的问题是,在 5-10 分钟后,其中一个抓取器会停止,唯一能唤醒它的是手动刷新浏览器上的页面。 If I leave it for an hour or so, 3 of the 4 stop and only one is running.如果我离开它一个小时左右,4 个中的 3 个停止,只有一个在运行。 They don't error out or print anything it just stops running.他们不会出错或打印任何它只是停止运行的东西。 I've tried it on 2 different laptops and they both have the same issue.我在两台不同的笔记本电脑上试过,它们都有同样的问题。 I've also tried doing this with 4 different mp.Process() running scrape_url and that also does the same thing.我也试过用 4 个不同的 mp.Process() 运行 scrape_url 来做这件事,它也做同样的事情。 Has anyone else run into this issue or am I doing something wrong here?有没有其他人遇到过这个问题,或者我在这里做错了什么?

For one thing, Selenium is already creating a process so it is far better to be using multithreading instead of multiprocessing since each thread will be starting a process anyway.一方面,Selenium 已经在创建一个进程,因此使用多线程而不是多处理要好得多,因为无论如何每个线程都会启动一个进程。 Also, in scrape_urls after your driver = webdriver.Chrome(driver_dir) statement, the rest of the function should be enclosed in a try/finally statement where the finally block contains driver.quit() to ensure that the driver process is terminated whether there is an exception or not.另外,在你的driver = webdriver.Chrome(driver_dir) scrape_urls之后的 scrape_urls 中,function 的 rest 应该包含在try/finally语句中,其中finally块包含driver.quit()以确保驱动程序进程终止是否存在是否例外。 Right now you are leaving all the driver processes running.现在你让所有的驱动进程都在运行。

You might also consider using the following technique that creates a thread pool of size 4 (or less depending on how many URLs there are to process), but each thread in the pool automatically reuses the driver that has been allocated to its thread, which is kept in thread local storage.您还可以考虑使用以下技术创建大小为 4 的线程池(或更小,具体取决于要处理的 URL 数量),但池中的每个线程都会自动重用已分配给其线程的驱动程序,即保存在线程本地存储中。 You might wish to change the options used to create the driver (currently "headless" mode):您可能希望更改用于创建驱动程序的选项(当前为“无头”模式):

import json
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from multiprocessing.pool import ThreadPool
import threading
import gc


threadLocal = threading.local()

class Driver:
    def __init__(self):
        options = webdriver.ChromeOptions()
        options.add_argument("--headless")
        options.add_experimental_option('excludeSwitches', ['enable-logging'])
        self.driver = webdriver.Chrome(options=options)

    def __del__(self):
        self.driver.quit() # clean up driver when we are cleaned up
        print('The driver has been "quitted".')

    @classmethod
    def create_driver(cls):
        the_driver = getattr(threadLocal, 'the_driver', None)
        if the_driver is None:
            print('Creating new driver.')
            the_driver = cls()
            threadLocal.the_driver = the_driver
        driver = the_driver.driver
        the_driver = None
        return driver

def scraper(url):
    """
    This now scrapes a single URL.
    """
    driver = Driver.create_driver()
    driver.get(url)
    element = WebDriverWait(driver, timeout=7).until(lambda d: d.find_element(by=By.CLASS_NAME, value="InterestingData"))
    print("got data from: ", url)
    return element.text

with open('array_of_urls', 'r') as infile:
    urls = json.load(infile)
number_of_processes = min(4, len(urls))
with ThreadPool(processes=number_of_processes) as pool:
    result_array = pool.map(scraper, urls)

    # Must ensure drivers are quitted before threads are destroyed:
    del threadLocal
    # This should ensure that the __del__ method is run on class Driver:
    gc.collect()

    pool.close()
    pool.join()

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM