简体   繁体   English

Selenium 中的多线程/多处理

[英]Multithreading / Multiprocessing in Selenium

I wrote a python script that scrapes the urls from a text file and prints out the href from an element.我编写了一个 python 脚本,该脚本从文本文件中抓取 url 并从元素中打印出 href。 However my goal here is to make it faster being able to do it on a larger scale with Multiprocessing or Multithreading.然而,我的目标是通过多处理或多线程更快地实现更大规模的操作。

In the workflow each browser process would get the href from the current url and load the next link from the que in the same browser istance (let's say there are 5).在工作流程中,每个浏览器进程都会从当前的 url 中获取 href,并在同一浏览器中从 que 加载下一个链接(假设有 5 个)。 Of couse each link should get scraped 1 time.当然,每个链接都应该被刮掉 1 次。

Example input File : HNlinks.txt示例输入文件HNlinks.txt

https://news.ycombinator.com/user?id=ingve
https://news.ycombinator.com/user?id=dehrmann
https://news.ycombinator.com/user?id=thanhhaimai
https://news.ycombinator.com/user?id=rbanffy
https://news.ycombinator.com/user?id=raidicy
https://news.ycombinator.com/user?id=svenfaw
https://news.ycombinator.com/user?id=ricardomcgowan

Code:代码:

from selenium import webdriver

driver = webdriver.Chrome()
input1 = open("HNlinks.txt", "r")
urls1 = input1.readlines()

for url in urls1:
    driver.get(url)

    links=driver.find_elements_by_class_name('athing')
    for link in links:
        print(link.find_element_by_css_selector('a').get_attribute("href"))

Using multiprocessing*使用多处理*

Note: I have not test-run this answer locally.注意:我没有在本地测试运行这个答案。 Please try and give feedback:请尝试并提供反馈:

from multiprocessing import Pool
from selenium import webdriver

input1 = open("HNlinks.txt", "r")
urls1 = input1.readlines()

def load_url(url):
    driver = webdriver.Chrome()
    driver.get(url)
    links=driver.find_elements_by_class_name('athing')
    for link in links:
        print(link.find_element_by_css_selector('a').get_attribute("href"))

if __name__ == "__main__":
    # how many concurrent processes do you want to span? this is also limited by 
    the number of cores that your computer has.
    processes = len(urls1)
    p = Pool(processes ) 
    p.map(load_url, urls1)
    p.close()
    p.join()

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM