我如何在Selenium Web驅動程序python中使用多線程進行Web爬網

Question

我有一個場景，我要轉到網頁並在新窗口中打開每個鏈接並檢查特定文檔，但是花大量時間瀏覽每個鏈接，所以有沒有辦法使用多線程來提高性能

已知問題

硒不是線程安全的

但是我可以為每個線程創建驅動程序的多個實例，以解決此問題

我正在使用的當前代碼

    for tag in self.driver.find_elements_by_xpath('//body//a[@href]'):
        current_window = self.driver.current_window_handle
        if href:
                self.driver.execute_script('window.open(arguments[0]);', href)
                time.sleep(10)
                new_window = [window for window in self.driver.window_handles if window != current_window][0]
                self.driver.switch_to.window(new_window)
                # Execute required operations
                func_url=self.driver.current_url
                self.driver.close()
                self.driver.switch_to.window(current_window)
                if not func_url:
                    continue
                if re.search('\.(?:pdf|png|jpg|doc|ppt|zip)',func_url):
                        cat=fd.findCat(func_url)
                        fd.findDate(func_url)
                time.sleep(10)

Answer 1

如果您想贏得時間，首先必須使用WebDriverWait：

這是一個例子

def exec_sync(driver, script, waitTime, scriptName):
    global contador
    print scriptName
    driver.execute_script(script)
    WebDriverWait(driver, waitTime).until(
        EC.presence_of_element_located(
            (By.ID, 'uniqueIdentifier' + str(contador))
        )
        #driver.find_element_by_id(cadena)
    )
    contador += 1
    print 'Success'

其次，WebDriver不像您所說的那樣用於多線程，因此您必須使用SEMAPHORE

https://docs.python.org/2/library/threading.html#semaphore-objects https://www.tutorialspoint.com/python/python_multithreading.htm

這是多線程的唯一方法

MT + Wait.until是節省時間所需的

我如何在Selenium Web驅動程序python中使用多線程進行Web爬網

問題描述

1 個解決方案

解決方案1
1 已采納 2017-04-25 19:01:08

我如何在Selenium Web驅動程序python中使用多線程進行Web爬網

問題描述

1 個解決方案

解決方案1 1 已采納 2017-04-25 19:01:08

解決方案1
1 已采納 2017-04-25 19:01:08