[英]Multithreading / Multiprocessing in Selenium
我編寫了一個 python 腳本,該腳本從文本文件中抓取 url 並從元素中打印出 href。 然而,我的目標是通過多處理或多線程更快地實現更大規模的操作。
在工作流程中,每個瀏覽器進程都會從當前的 url 中獲取 href,並在同一瀏覽器中從 que 加載下一個鏈接(假設有 5 個)。 當然,每個鏈接都應該被刮掉 1 次。
示例輸入文件: HNlinks.txt
https://news.ycombinator.com/user?id=ingve
https://news.ycombinator.com/user?id=dehrmann
https://news.ycombinator.com/user?id=thanhhaimai
https://news.ycombinator.com/user?id=rbanffy
https://news.ycombinator.com/user?id=raidicy
https://news.ycombinator.com/user?id=svenfaw
https://news.ycombinator.com/user?id=ricardomcgowan
代碼:
from selenium import webdriver
driver = webdriver.Chrome()
input1 = open("HNlinks.txt", "r")
urls1 = input1.readlines()
for url in urls1:
driver.get(url)
links=driver.find_elements_by_class_name('athing')
for link in links:
print(link.find_element_by_css_selector('a').get_attribute("href"))
注意:我沒有在本地測試運行這個答案。 請嘗試並提供反饋:
from multiprocessing import Pool
from selenium import webdriver
input1 = open("HNlinks.txt", "r")
urls1 = input1.readlines()
def load_url(url):
driver = webdriver.Chrome()
driver.get(url)
links=driver.find_elements_by_class_name('athing')
for link in links:
print(link.find_element_by_css_selector('a').get_attribute("href"))
if __name__ == "__main__":
# how many concurrent processes do you want to span? this is also limited by
the number of cores that your computer has.
processes = len(urls1)
p = Pool(processes )
p.map(load_url, urls1)
p.close()
p.join()
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.