简体   繁体   English

如何使用 Python 在 Selenium 中同时运行多个 webdriver 进程?

[英]How can I run multiple webdriver processes concurrently in Selenium using Python?

I have a list of thousands of URLs.我有一个包含数千个 URL 的列表。 I want to use Python/Selenuim to:我想使用 Python/Selenuim 来:

  1. load each URL,加载每个 URL,
  2. select one element选择一个元素
  3. close the page关闭页面

To make it run faster, I want to run lots of these processes in parallel, but I can only work out how to do it one at a time.为了让它运行得更快,我想并行运行许多这些进程,但我一次只能弄清楚如何做一个。

from selenium import webdriver
driver = webdriver.Chrome()

url_list = [
            'https://www.instagram.com/p/Bj7NmpqBuSw/?tagged=style', 
            'https://www.instagram.com/p/Bj7Nic3Au85/?tagged=style'
            ]

for url in url_list:
    driver.get(url)
    driver.find_elements_by_class_name("class-name-for-profile-link")
    driver.close()

​ I tried using lots of browser tabs我试过使用很多浏览器标签

driver.switch_to.window(driver.window_handles[1])

but the handles are a bit tricky to manage.但手柄有点难以管理。

How can I run this process in parallel?如何并行运行此过程?

tl;dr I created this gist to give an easy example of how to run simple Selenium tasks in parallel. tl;dr我创建这个要点是为了给出一个简单的例子,说明如何并行运行简单的 Selenium 任务。 You can adapt it to your own purposes.您可以根据自己的目的进行调整。


The issue with parallelising Selenium scripts is that Selenium workers are themselves processes.并行化 Selenium 脚本的问题在于 Selenium 工作人员本身就是进程。 The above script uses two FIFO queues , one that stores the IDs of idle Selenium workers and one that stores data to pass to the workers.上面的脚本使用了两个FIFO 队列,一个存储空闲 Selenium 工作线程的 ID,另一个存储要传递给工作线程的数据。 Background master threads listen across both these queues and assign incoming data to idle workers, taking the ID of selenium workers off the worker queue while the worker does its work.后台主线程侦听这两个队列并将传入数据分配给空闲的工作人员,在工作人员执行其工作时从工作人员队列中删除硒工作人员的 ID。

All you would need to do to adapt the code to your purposes is to change the code in the function selenium_task .要使代码适应您的目的,您需要做的就是更改函数selenium_task的代码。 Hope this helps!希望这可以帮助!

You can use this to loop parallel.您可以使用来并行循环。 Sample usage:示例用法:

from joblib import Parallel, delayed

def do_stuff(url):
    phantom = webdriver.PhantomJS('/path/to/phantomjs') # you can use any driver
    phantom.get(url)
    # do your stuff
    phantom.close()

Parallel(n_jobs=-1)(delayed(do_stuff)(url) for url in urls) #execute parallel for all urls

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM