简体   繁体   English

我如何在Selenium Web驱动程序python中使用多线程进行Web爬网

[英]how can i use multi-threading for web-crawling in selenium web-driver python

i have scenario where i go to a webpage and open each link in a new window and check for specific documents but its taking enormous time to go through each link so is there a way to increase performance using multi-threading 我有一个场景,我要转到网页并在新窗口中打开每个链接并检查特定文档,但是花大量时间浏览每个链接,所以有没有办法使用多线程来提高性能

known problem 已知问题

selenium is not thread safe 硒不是线程安全的

but i can create multiple instances of driver for each thread which takes care of this problem 但是我可以为每个线程创建驱动程序的多个实例,以解决此问题

current code which i am using 我正在使用的当前代码

    for tag in self.driver.find_elements_by_xpath('//body//a[@href]'):
        current_window = self.driver.current_window_handle
        if href:
                self.driver.execute_script('window.open(arguments[0]);', href)
                time.sleep(10)
                new_window = [window for window in self.driver.window_handles if window != current_window][0]
                self.driver.switch_to.window(new_window)
                # Execute required operations
                func_url=self.driver.current_url
                self.driver.close()
                self.driver.switch_to.window(current_window)
                if not func_url:
                    continue
                if re.search('\.(?:pdf|png|jpg|doc|ppt|zip)',func_url):
                        cat=fd.findCat(func_url)
                        fd.findDate(func_url)
                time.sleep(10)

If you want to win time, first of all you have to use WebDriverWait: 如果您想赢得时间,首先必须使用WebDriverWait:

This is an EXAMPLE 这是一个例子

def exec_sync(driver, script, waitTime, scriptName):
    global contador
    print scriptName
    driver.execute_script(script)
    WebDriverWait(driver, waitTime).until(
        EC.presence_of_element_located(
            (By.ID, 'uniqueIdentifier' + str(contador))
        )
        #driver.find_element_by_id(cadena)
    )
    contador += 1
    print 'Success'

Second, WebDriver is not for multithreading as you have said, so you HAVE TO use SEMAPHORE 其次,WebDriver不像您所说的那样用于多线程,因此您必须使用SEMAPHORE

https://docs.python.org/2/library/threading.html#semaphore-objects https://www.tutorialspoint.com/python/python_multithreading.htm https://docs.python.org/2/library/threading.html#semaphore-objects https://www.tutorialspoint.com/python/python_multithreading.htm

It's the only way for multithreading 这是多线程的唯一方法

MT + Wait.until is what you need to save time MT + Wait.until是节省时间所需的

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM