简体   繁体   English

Python 与 selenium 并行执行

[英]Python parallel execution with selenium

I'm confused about parallel execution in python using selenium.我对使用 selenium 在 python 中的并行执行感到困惑。 There seems to be a few ways to go about it, but some seem out of date.似乎有几种方法可以 go 关于它,但有些似乎已经过时了。

  1. There's a python module called python-wd-parallel which seems to have some functionality to do this, but it's from 2013, is this still useful now?有一个名为python-wd-parallel的 python 模块似乎有一些功能可以做到这一点,但它是从 2013 年开始的,现在这还有用吗? I also found this example .我也找到了这个例子

  2. There's concurrent.futures , this seems a lot newer, but not so easy to implement.concurrent.futures ,这似乎更新了很多,但实现起来并不容易。 Anyone have a working example with parallel execution in selenium?任何人都有在 selenium 中并行执行的工作示例?

  3. There's also using just threads and executors to get the job done, but I feel this will be slower, because it's not using all the cores and is still running in serial formation.还有只使用线程和执行程序来完成工作,但我觉得这会更慢,因为它没有使用所有内核并且仍然以串行形式运行。

What is the latest way to do parallel execution using selenium?使用 selenium 进行并行执行的最新方法是什么?

Use joblib's Parallel module to do that, its a great library for parallel execution.使用joblib 的 Parallel模块来做到这一点,它是一个很好的并行执行库。

Lets say we have a list of urls named urls and we want to take a screenshot of each one in parallel假设我们有一个名为urls的 url 列表,我们想并行截取每个urls的屏幕截图

First lets import the necessary libraries首先让我们导入必要的库

from selenium import webdriver
from joblib import Parallel, delayed

Now lets define a function that takes a screenshot as base64现在让我们定义一个将屏幕截图作为 base64 的函数

def take_screenshot(url):
    phantom = webdriver.PhantomJS('/path/to/phantomjs')
    phantom.get(url)
    screenshot = phantom.get_screenshot_as_base64()
    phantom.close()

    return screenshot

Now to execute that in parallel what you would do is现在要并行执行,你要做的是

screenshots = Parallel(n_jobs=-1)(delayed(take_screenshot)(url) for url in urls)

When this line will finish executing, you will have in screenshots all of the data from all of the processes that ran.当这一行完成执行时,您将在screenshots中看到来自所有运行进程的所有数据。

Explanation about Parallel关于平行的说明

  • Parallel(n_jobs=-1) means use all of the resources you can Parallel(n_jobs=-1)意味着使用你可以使用的所有资源
  • delayed(function)(input) is joblib 's way of creating the input for the function you are trying to run on parallel joblib delayed(function)(input)joblib为您尝试并行运行的函数创建输入的方式

More information can be found on the joblib docs更多信息可以在joblib文档中找到

  1. Python Parallel Wd seams to be dead from its github (last commit 9 years ago). Python 平行 Wd接缝因 github(最后一次提交 9 年前)而死。 Also it implements an obsolete protocol for selenium.它还为 selenium 实现了一个过时的协议 Finally code is proprietary saucelabs .最后代码是专有的saucelabs

Generally it's better to use SeleniumBase a Python test framework based on selenium and pytest.通常最好使用SeleniumBase一个基于 selenium 和 pytest 的 Python 测试框架。 It's very complete supporting everything for performance boost, parallel threads and much more.它非常完整地支持性能提升、并行线程等等的一切。 If that's not your case... keep reading.如果那不是你的情况......继续阅读。

Selenium Performance Boost ( concurrent.futures ) Selenium 性能提升( concurrent.futures

Short Answer简答

  • Both threads and processes will give you a considerable speed up on your selenium code . threadsprocesses都将大大加快selenium 代码的速度。

Short examples are given bellow.下面给出了简短的例子。 The selenium work is done by selenium_title function that return the page title. selenium 工作由返回页面标题的selenium_title function 完成。 That don't deal with exceptions happening during each thread/process execution.这不处理每个线程/进程执行期间发生的异常。 For that look Long Answer - Dealing with exceptions .对于那个看起来很长的答案-处理异常

  1. Pool of thread workers concurrent.futures.ThreadPoolExecutor .线程池concurrent.futures.ThreadPoolExecutor
from selenium import webdriver  
from concurrent import futures

def selenium_title(url):  
  wdriver = webdriver.Chrome() # chrome webdriver
  wdriver.get(url)  
  title = wdriver.title  
  wdriver.quit()
  return title

links = ["https://www.amazon.com", "https://www.google.com"]

with futures.ThreadPoolExecutor() as executor: # default/optimized number of threads
  titles = list(executor.map(selenium_title, links))
  1. Pool of processes workers concurrent.futures.ProcessPoolExecutor .进程池工人concurrent.futures.ProcessPoolExecutor Just need to replace ThreadPoolExecuter by ProcessPoolExecutor in the code above.只需要将上面代码中的ThreadPoolExecuter替换为ProcessPoolExecutor即可。 They are both derived from the Executor base class.它们都源自Executor基础 class。 Also you must protect the main , like below.此外,您必须保护main ,如下所示。
if __name__ == '__main__':
 with futures.ProcessPoolExecutor() as executor: # default/optimized number of processes
   titles = list(executor.map(selenium_title, links))

Long Answer长答案

Why Threads with Python GIL works?为什么使用 Python GIL 的Threads有效?

Even tough Python has limitations on threads due the Python GIL and even though threads will be context switched.由于 Python GIL 和即使线程将被上下文切换,即使是坚韧的 Python 对线程也有限制。 Performance gain will come due to implementation details of Selenium. Selenium 的实现细节将带来性能提升。 Selenium works by sending commands like POST , GET ( HTTP requests ). Selenium 通过发送诸如POSTGETHTTP requests )之类的命令来工作。 Those are sent to the browser driver server.这些被发送到浏览器驱动程序服务器。 Consequently you might already know I/O bound tasks ( HTTP requests ) releases the GIL, so the performance gain.因此,您可能已经知道 I/O 绑定任务( HTTP requests )释放 GIL,因此性能提升。

Dealing with exceptions处理异常

We can make small modifications on the example above to deal with Exceptions on the threads spawned.我们可以对上面的示例进行一些小修改,以处理产生的线程上的Exceptions Instead of using executor.map we use executor.submit .我们不使用executor.map ,而是使用executor.submit That will return the title wrapped on Future instances.这将返回包装在Future实例上的标题。

To access the returned title we can use future_titles[index].result where index size len(links) , or simple use a for like bellow.要访问返回的标题,我们可以使用future_titles[index].result where index size len(links) ,或者简单地使用 a for like bellow。

with futures.ThreadPoolExecutor() as executor:
  future_titles = [ executor.submit(selenium_title, link) for link in links ]
  for future_title, link in zip(future_titles, links): 
    try:        
      title = future_title.result() # can use `timeout` to wait max seconds for each thread               
    except Exception as exc: # this thread migh have had an exception
      print('url {:0} generated an exception: {:1}'.format(link, exc))

Note that besides iterating over future_titles we iterate over links so in case an Exception in some thread we know which url(link) was responsible for that.请注意,除了对future_titles进行迭代之外,我们还会对links进行迭代,因此如果某个线程中出现Exception ,我们知道哪个url(link)对此负责。

The futures.Future class are cool because they give you control on the results received from each thread. futures.Future class 很酷,因为它们可以让您控制从每个线程收到的结果。 Like if it completed correctly or there was an exception and others, more about here .就像它是否正确完成或有异常等等,更多关于这里

Also important to mention is that futures.as_completed is better if you don´t care which order the threads return items.同样重要的是,如果您不关心线程返回项目的顺序, futures.as_completed会更好。 But since the syntax to control exceptions with that is a little ugly I omitted it here.但由于控制异常的语法有点难看,我在这里省略了它。

Performance gain and Threads性能提升和线程

First why I've been always using threads for speeding up my selenium code:首先为什么我一直使用线程来加速我的 selenium 代码:

  • On I/O bound tasks my experience with selenium shows that there's minimal or no diference between using a pool of Processes ( Process ) or Threads ( Threads ).在 I/O 绑定任务上,我对 selenium 的经验表明,使用进程池 ( Process ) 或线程池 ( Threads ) 之间的差异很小或没有差异 Here also reach similar conclusions about Python threads vs processes on I/O bound tasks.这里也得出关于 Python 线程与 I/O 绑定任务上的进程的类似结论。
  • We also know that processes use their own memory space.我们也知道进程使用自己的 memory 空间。 That means more memory consumption.这意味着更多的 memory 消耗。 Also processes are a little slower to be spawned than threads.进程的生成速度也比线程慢一些。

I created a project to do this and it reuses webdriver instances for better performance:我创建了一个项目来执行此操作,它重用 webdriver 实例以获得更好的性能:

https://github.com/testlabauto/local_selenium_pool https://github.com/testlabauto/local_selenium_pool

https://pypi.org/project/local-selenium-pool/ https://pypi.org/project/local-selenium-pool/

For running Python tests in parallel, you may consider using pytest-xdist to handle the multiple processes for you: https://github.com/pytest-dev/pytest-xdist .对于并行运行 Python 测试,您可以考虑使用pytest-xdist为您处理多个进程: https://github.com/pytest-dev/pytest-xdist That's a plugin for the pytest framework.这是pytest框架的插件。

And for running Python Selenium tests in parallel with pytest , there's a framework that may simplify the Selenium test multithreading for you, SeleniumBase : https://github.com/seleniumbase/SeleniumBase . And for running Python Selenium tests in parallel with pytest , there's a framework that may simplify the Selenium test multithreading for you, SeleniumBase : https://github.com/seleniumbase/SeleniumBase . It functions as a pytest plugin so you can use the pytest multi-threading args provided by pytest-xdist , and run all your Selenium Python tests multithreaded as needed. It functions as a pytest plugin so you can use the pytest multi-threading args provided by pytest-xdist , and run all your Selenium Python tests multithreaded as needed. Eg: pytest -n 4 for 4 parallel threads.例如: pytest -n 4用于 4 个并行线程。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM