简体   繁体   English

Python 多处理 class

[英]Python multiprocessing a class

I am trying to multiprocess selenium where each process is spawned with a selenium driver and a session (each process is connected with a different account).我正在尝试多进程 selenium ,其中每个进程都使用 selenium 驱动程序和 session 生成(每个进程都与不同的帐户连接)。

I have a list of URLs to visit.我有一个要访问的 URL 列表。 Each URL needs to be visited once by one of the account (no matter which one).每个 URL 都需要被其中一个账号访问一次(不管是哪一个)。

To avoid some nasty global variable management, I tried to initialize each process with a class object using the initializer of multiprocessing.pool .为了避免一些讨厌的全局变量管理,我尝试使用multiprocessing.poolinitializer程序使用 class object 初始化每个进程。

After that, I can't figure out how to distribute tasks to the process knowing that the function used by each process has to be in the class.之后,我不知道如何将任务分配给进程,因为每个进程使用的 function 必须在 class 中。

Here is a simplified version of what I'm trying to do:这是我正在尝试做的简化版本:

from selenium import webdriver
import multiprocessing

account =  [{'account':1},{'account':2}]

class Collector():

    def __init__(self, account):

        self.account = account
        self.driver = webdriver.Chrome()

    def parse(self, item):

        self.driver.get(f"https://books.toscrape.com{item}")

if __name__ == '__main__':
    
    processes = 1
    pool = multiprocessing.Pool(processes,initializer=Collector,initargs=[account.pop()])

    items = ['/catalogue/a-light-in-the-attic_1000/index.html','/catalogue/tipping-the-velvet_999/index.html']
    
    pool.map(parse(), items, chunksize = 1)

    pool.close()
    pool.join() 

The problem comes on the the pool.map line, there is no reference to the instantiated object inside the subprocess.问题出现在pool.map行上,子进程内部没有引用实例化的 object。 Another approach would be to distribute URLs and parse during the init but this would be very nasty.另一种方法是在初始化期间分发 URL 和解析,但这会非常讨厌。

Is there a way to achieve this?有没有办法做到这一点?

I'm not entirely certain if this solves your problem.我不完全确定这是否能解决您的问题。

If you have one account per URL then you could do this:如果您每个 URL 有一个帐户,那么您可以这样做:

from selenium import webdriver
from multiprocessing import Pool

items = ['/catalogue/a-light-in-the-attic_1000/index.html',
         '/catalogue/tipping-the-velvet_999/index.html']
accounts = [{'account': 1}, {'account': 2}]
baseurl = 'https://books.toscrape.com'

def process(i, a):
    print(f'Processing account {a}')
    options = webdriver.ChromeOptions()
    options.add_argument('--headless')

    with webdriver.Chrome(options=options) as driver:
        driver.get(f'{baseurl}{i}')


def main():
    with Pool() as pool:
        pool.starmap(process, zip(items, accounts))


if __name__ == '__main__':
    main()

If the number of accounts doesn't match the number of URLs, you have said that it doesn't matter which account GETs from which URL.如果帐户数量与 URL 数量不匹配,您已经说过哪个帐户从哪个 URL 获取无关紧要。 So, in that case, you could just select the account to use at random ( random.choice() )因此,在这种情况下,您可以只使用 select 随机使用的帐户( random.choice()

Since Chrome starts its own process, there is really no need to be using multiprocessing when multithreading will suffice.由于 Chrome 会启动自己的进程,因此当多线程就足够时,确实不需要使用多处理。 I would like to offer a more general solution to handle the case where you have N URLs you want to retrieve where N might be very large but you would like to limit the number of concurrent Selenium sessions you have to MAX_DRIVERS where MAX_DRIVERS is a significantly smaller number.我想提供一个更通用的解决方案来处理您想要检索 N 个 URL 的情况,其中 N 可能非常大,但您想将并发 Selenium 会话的数量限制为 MAX_DRIVERS,其中 MAX_DRIVERS 明显更小数字。 Therefore, you only want to create one driver session for each thread in the pool and reuse it as necessary.因此,您只想为池中的每个线程创建一个驱动程序 session 并在必要时重用它。 Then the problem becomes calling quit on the driver when you are finished with the pool so that you don't leave any Selenium processes behind running.然后,当您完成池时,问题就变成了在驱动程序上调用quit ,这样您就不会留下任何 Selenium 进程运行。

The following code uses threadlocal storage, which is unique to each thread, to store the current driver instance for each pool thread and uses a class destructor to call the driver's quit method when the class instance is destroyed:以下代码使用每个线程唯一的线程本地存储来存储每个池线程的当前驱动程序实例,并使用threadlocal析构函数在 class 实例被销毁时调用驱动程序的quit方法:

from selenium import webdriver
from multiprocessing.pool import ThreadPool
import threading

items = ['/catalogue/a-light-in-the-attic_1000/index.html',
         '/catalogue/tipping-the-velvet_999/index.html']
accounts = [{'account': 1}, {'account': 2}]
baseurl = 'https://books.toscrape.com'

threadLocal = threading.local()

class Driver:
    def __init__(self):
        options = webdriver.ChromeOptions()
        options.add_argument("--headless")
        options.add_experimental_option('excludeSwitches', ['enable-logging'])
        self.driver = webdriver.Chrome(options=options)

    def __del__(self):
        self.driver.quit() # clean up driver when we are cleaned up
        print('The driver has been "quitted".')

    @classmethod
    def create_driver(cls):
        the_driver = getattr(threadLocal, 'the_driver', None)
        if the_driver is None:
            the_driver = cls()
            threadLocal.the_driver = the_driver
        return the_driver.driver


def process(i, a):
    print(f'Processing account {a}')
    driver = Driver.create_driver()
    driver.get(f'{baseurl}{i}')


def main():
    global threadLocal

    # We never want to create more than
    MAX_DRIVERS = 8 # Rather arbitrary
    POOL_SIZE = min(len(accounts), MAX_DRIVERS)
    with ThreadPool(POOL_SIZE) as pool:
        pool.starmap(process, zip(items, accounts))
    # ensure the drivers are "quitted":
    del threadLocal
    import gc
    gc.collect() # a little extra insurance

if __name__ == '__main__':
    main()

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM