简体   繁体   English

Python 硒多处理

[英]Python selenium multiprocessing

I've written a script in python in combination with selenium to scrape the links of different posts from its landing page and finally get the title of each post by tracking the url leading to its inner page.我在 python 中结合 selenium 编写了一个脚本,以从其登录页面抓取不同帖子的链接,最后通过跟踪通向其内页的 url 来获取每个帖子的标题。 Although the content I parsed here are static ones, I used selenium to see how it works in multiprocessing.虽然我这里解析的内容是静态的,但我使用了 selenium 来看看它在多处理中是如何工作的。

However, my intention is to do the scraping using multiprocessing.但是,我的目的是使用多处理进行抓取。 So far I know that selenium doesn't support multiprocessing but it seems I was wrong.到目前为止,我知道 selenium 不支持多处理,但似乎我错了。

My question: how can I reduce the execution time using selenium when it is made to run using multiprocessing?我的问题:当使用多处理运行时,如何使用 selenium 减少执行时间?

This is my try (it's a working one) : This is my try (it's a working one)

import requests
from urllib.parse import urljoin
from multiprocessing.pool import ThreadPool
from bs4 import BeautifulSoup
from selenium import webdriver

def get_links(link):
  res = requests.get(link)
  soup = BeautifulSoup(res.text,"lxml")
  titles = [urljoin(url,items.get("href")) for items in soup.select(".summary .question-hyperlink")]
  return titles

def get_title(url):
  chromeOptions = webdriver.ChromeOptions()
  chromeOptions.add_argument("--headless")
  driver = webdriver.Chrome(chrome_options=chromeOptions)
  driver.get(url)
  sauce = BeautifulSoup(driver.page_source,"lxml")
  item = sauce.select_one("h1 a").text
  print(item)

if __name__ == '__main__':
  url = "https://stackoverflow.com/questions/tagged/web-scraping"
  ThreadPool(5).map(get_title,get_links(url))

how can I reduce the execution time using selenium when it is made to run using multiprocessing当使用多处理运行时,如何使用 selenium 减少执行时间

A lot of time in your solution is spent on launching the webdriver for each URL.解决方案中的大量时间都花在为每个 URL 启动 webdriver 上。 You can reduce this time by launching the driver only once per thread:您可以通过每个线程仅启动一次驱动程序来减少此时间:

(... skipped for brevity ...)

threadLocal = threading.local()

def get_driver():
  driver = getattr(threadLocal, 'driver', None)
  if driver is None:
    chromeOptions = webdriver.ChromeOptions()
    chromeOptions.add_argument("--headless")
    driver = webdriver.Chrome(chrome_options=chromeOptions)
    setattr(threadLocal, 'driver', driver)
  return driver


def get_title(url):
  driver = get_driver()
  driver.get(url)
  (...)

(...)

On my system this reduces the time from 1m7s to just 24.895s, a ~35% improvement.在我的系统上,这将时间从 1m7s 减少到 24.895s,改进了约 35%。 To test yourself, download the full script .要测试自己,请下载完整的脚本

Note: ThreadPool uses threads, which are constrained by the Python GIL.注意: ThreadPool使用受 Python GIL 约束的线程。 That's ok if for the most part the task is I/O bound.如果大部分任务是 I/O 绑定的,那没关系。 Depending on the post-processing you do with the scraped results, you may want to use a multiprocessing.Pool instead.根据您对抓取结果进行的后处理,您可能希望改用multiprocessing.Pool This launches parallel processes which as a group are not constrained by the GIL.这将启动作为一个组不受 GIL 约束的并行进程。 The rest of the code stays the same.其余代码保持不变。

The one potential problem I see with the clever one-driver-per-thread answer is that it omits any mechanism for "quitting" the drivers and thus leaving the possibility of processes hanging around.我看到一个聪明的每线程一个驱动程序答案的一个潜在问题是它省略了任何“退出”驱动程序的机制,从而留下了进程挂起的可能性。 I would make the following changes:我将进行以下更改:

  1. Use instead class Driver that will crate the driver instance and store it on the thread-local storage but also have a destructor that will quit the driver when the thread-local storage is deleted:使用类Driver来创建驱动程序实例并将其存储在线程本地存储上,但也有一个析构函数,当线程本地存储被删除时,它将quit驱动程序:
class Driver:
    def __init__(self):
        options = webdriver.ChromeOptions()
        options.add_argument("--headless")
        self.driver = webdriver.Chrome(options=options)

    def __del__(self):
        self.driver.quit() # clean up driver when we are cleaned up
        #print('The driver has been "quitted".')
  1. create_driver now becomes: create_driver现在变成:
threadLocal = threading.local()

def create_driver():
    the_driver = getattr(threadLocal, 'the_driver', None)
    if the_driver is None:
        the_driver = Driver()
        setattr(threadLocal, 'the_driver', the_driver)
    return the_driver.driver
  1. Finally, after you have no further use for the ThreadPool instance but before it is terminated, add the following lines to delete the thread-local storage and force the Driver instances' destructors to be called (hopefully):最后,在您不再使用ThreadPool实例但在它终止之前,添加以下行以删除线程本地存储并强制调用Driver实例的析构函数(希望如此):
del threadLocal
import gc
gc.collect() # a little extra insurance

My question: how can I reduce the execution time?我的问题:如何减少执行时间?

Selenium seems the wrong tool for web scraping - though I appreciate YMMV, in particular if you need to simulate user interaction with the web site or there is some JavaScript limitation/requirement. Selenium 似乎是网络抓取的错误工具 - 尽管我很欣赏 YMMV,特别是如果您需要模拟用户与网站的交互或存在一些 JavaScript 限制/要求。

For scraping tasks without much interaction, I have had good results using the opensource Scrapy Python package for large-scale scrapying tasks.对于没有太多交互的抓取任务,我使用开源Scrapy Python 包进行大规模抓取任务取得了不错的效果。 It does multiprocessing out of the box, it is easy to write new scripts and store the data in files or a database -- and it is really fast .它开箱即用地进行多处理,很容易编写新脚本并将数据存储在文件或数据库中——而且速度非常

Your script would look something like this when implemented as a fully parallel Scrapy spider (note I did not test this, see documentation on selectors ).当作为完全并行的 Scrapy 蜘蛛实现时,您的脚本看起来像这样(请注意,我没有对此进行测试,请参阅有关选择器的文档)。

import scrapy
class BlogSpider(scrapy.Spider):
    name = 'blogspider'
    start_urls = ['https://stackoverflow.com/questions/tagged/web-scraping']

    def parse(self, response):
        for title in response.css('.summary .question-hyperlink'):
            yield title.get('href')

To run put this into blogspider.py and run要运行将其放入blogspider.py并运行

$ scrapy runspider blogspider.py

See the Scrapy website for a complete tutorial.有关完整教程,请参阅Scrapy 网站

Note that Scrapy also supports JavaScript through scrapy-splash , thanks to the pointer by @SIM.请注意,由于@SIM 的指针,Scrapy 还通过scrapy-splash支持 JavaScript。 I didn't have any exposure with that so far so can't speak to this other than it looks well integrated with how Scrapy works.到目前为止,我没有任何接触,所以除了它看起来与 Scrapy 的工作方式很好地集成之外,无法谈论它。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM