简体   繁体   English

Python 3.4中的Selenium Multiprocessing帮助

[英]Selenium Multiprocessing Help in Python 3.4

I am in over my head trying to use Selenium to get the number of results for specific searches on a website. 我非常想使用Selenium来获取网站上特定搜索的结果数。 Basically, I'd like to make the process run faster. 基本上,我想使过程运行得更快。 I have code that works by iterating over search terms and then by newspapers and outputs the collected data into a CSV. 我有通过迭代搜索词然后由报纸迭代并将收集的数据输出为CSV的代码。 Currently, this runs to produce 3 search terms x 3 newspapers over 3 years giving me 9 CSVs in about 10 minutes per CSV. 目前,这可以在3年内生成3个搜索字词x 3个报纸,每个CSV文件能在10分钟内为我提供9个CSV文件。

I would like to use multiprocessing to run each search and newspaper combination simultaneously or at least faster. 我想使用多重处理来同时或至少更快地运行每个搜索和报纸组合。 I've tried to follow other examples on here, but have not been able to successfully implement them. 我尝试在此处遵循其他示例,但是未能成功实现它们。 Below is my code so far: 下面是我到目前为止的代码:

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import time
import os
import pandas as pd
from multiprocessing import Pool

def websitesearch(search):
    try:
        start = list_of_inputs[0]
        end = list_of_inputs[1]
        newsabbv=list_of_inputs[2]
        directory=list_of_inputs[3]
        os.chdir(directory)

        if search == broad:
            specification = "broad"
            relPapers = newsabbv

        elif search == narrow:
            specification = "narrow"
            relPapers = newsabbv

        elif search == general:
            specification = "allarticles"
            relPapers = newsabbv

        else:
            for newspapers in relPapers:

               ...rest of code here that gets the data and puts it in a list named all_Data...

                browser.close()
                df = pd.DataFrame(all_Data)
                df.to_csv(filename, index=False)          

    except:
        print('error with item')



if __name__ == '__main__':
 ...Initializing values and things like that go here. This helps with the setup for search...

    #These are things that go into the function        
    start = ["January",2015]
    end = ["August",2017]
    directory = "STUFF GOES HERE"
    newsabbv = all_news_abbv
    search_list = [narrow, broad, general]

    list_of_inputs = [start,end,newsabbv,directory]    

    pool = Pool(processes=4)
    for search in search_list:
        pool.map(websitesearch, search_list)
        print(list_of_inputs)        

If I add in a print statement in the main() function, it will print, but nothing really ends up happening. 如果我在main()函数中添加了一条print语句,它将进行打印,但实际上并没有发生任何事情。 I'd appreciate any and all help. 我将不胜感激。 I left out the code that gets the values and puts it into a list since its convoluted but I know it works. 我遗漏了获取值的代码,并将其放入列表中,因为它令人费解,但我知道它可以工作。

Thanks in advance for any and all help! 在此先感谢您提供的所有帮助! Let me know if there is more information I can provide. 让我知道我是否可以提供更多信息。

Isaac 以撒

EDIT: I have looked into more help online and realize that I misunderstood the purpose of mapping a list to the function using pool.map(fn, list). 编辑:我已经在线寻求更多帮助,意识到我误解了使用pool.map(fn,list)将列表映射到函数的目的。 I have updated my code to reflect my current approach that is still not working. 我已经更新了代码,以反映当前仍然无法使用的方法。 I also moved the initializing values into the main function. 我还将初始化值移到了main函数中。

i don't think it can be multiprocessing with your way. 我不认为这可以按照您的方式进行多处理。 Because it's still have queue process there (not queue module) caused by selenium. 因为那里仍然有硒引起的队列处理(不是队列模块)。

The reason is...selenium only can handle one window, cannot handle several window or tab browser at the same time (limitation of the window_handle features). 原因是...硒只能处理一个窗口,不能同时处理多个窗口或选项卡浏览器(window_handle功能的限制)。 that's means....your multi process only processing data process in memory that send to selenium or crawled by selenium. 这意味着....您的多进程仅处理内存中发送到硒或被硒抓取的数据进程。 by try process the crawl of selenium in one script file, will make the selenium as the bottle neck process's source. 通过尝试在一个脚本文件中处理硒的爬网,将使硒成为瓶颈工艺的源头。

the best way to make real multiprocess is: 实现真正的多进程的最佳方法是:

  1. make a script that use selenium to handle that url to crawl by selenium and save it as a file. 制作一个使用selenium处理该url的脚本,以selenium进行爬网并将其保存为文件。 eg crawler.py and make sure the script have print command to print the result 例如crawler.py并确保脚本具有打印命令以打印结果

eg: 例如:

import -> all modules that you need to run selenium
import sys

url = sys.argv[1] #you will catch the url 

driver = ......#open browser

driver.get(url)
#just continue the script base on your method

print(--the result that you want--)
sys.exit(0)

i can give more explanation because this is the main core of the process, and what you want to do on that web, only you that understood. 我可以提供更多解释,因为这是该过程的主要核心,只有您自己理解,您想在该网络上做什么。

  1. make another script file that: 制作另一个脚本文件:

a. 一种。 devide the url, multi process means make some process and run it together with all cpu cores, the best way to make it... it's start by devide the input process, in your case maybe the url target (you don't give us, the website target that you want to crawl). 设计URL,多进程意味着创建一些进程并与所有CPU内核一起运行,这是实现它的最佳方法...首先是确定输入过程,在您的情况下,也许是URL目标(您不给我们) ,您要抓取的网站目标)。 but every pages of the website have the different url. 但是网站的每个页面都有不同的网址。 just collect all url and devide it to several groups (best practice: your cpu cores - 1) 只需收集所有url并将其分为几个组即可(最佳实践:您的cpu核心-1)

eg: 例如:

import multiprocessing as mp

cpucore=int(mp.cpu_count())-1.

b. b。 send the url to processing with the crawl.py that already you made before (by subprocess, or other module e,g: os.system). 将网址发送到使用您之前进行过的crawl.py处理(通过子流程或其他模块,例如os.system)。 make sure you run the crawl.py max == the cpucore. 确保您运行crawl.py max == cpucore。

eg: 例如:

crawler = r'YOUR FILE DIRECTORY\crawler.py'

def devideurl():
    global url1, url2, url3, url4
    make script that result:
    urls1 = groups or list of url
    urls2 = groups or list of url
    urls3 = groups or list of url
    urls4 = groups or list of url

def target1():
    for url in url1:
        t1 = subprocess.Popen(['python', crawler, url], stdout = PIPE)
        #continue the script, base on your need...
        #do you see the combination between python crawler and url?
        #the cmd command will be: python crawler.py "value", the "value" is captured by sys.argv[1] command in crawler.py

def target2():
    for url in url2:
        t1 = subprocess.Popen(['python', crawler, url], stdout = PIPE)
        #continue the script, base on your need...
def target3():
    for url in url1:
        t1 = subprocess.Popen(['python', crawler, url], stdout = PIPE)
        #continue the script, base on your need...
def target4():
    for url in url2:
        t1 = subprocess.Popen(['python', crawler, url], stdout = PIPE)
        #continue the script, base on your need...

cpucore = int(mp.cpu_count())-1
pool = Pool(processes="max is the value of cpucore")
for search in search_list:
    pool.map(target1, devideurl)
    pool.map(target2, devideurl)
    pool.map(target3, devideurl)
    pool.map(target4, devideurl)
    #you can make it, more, depend on your cpu core

c. C。 get the printed result to the memory of main script 将打印结果保存到主脚本的内存中

d. d。 continous your script process to process the data that you already got. 继续您的脚本过程来处理您已经获得的数据。

  1. and the last, make the multiprocess script for the whole process in the main script. 最后,在主脚本中制作整个过程的多进程脚本。

with this method: 用这种方法:

you can open many browser windows and handle it with the same time, and because of the data processing that crawling from website is slower than data processing in memory, this method at least reducing the bottle neck on data flow. 您可以同时打开许多浏览器窗口并进行处理,并且由于从网站爬网进行的数据处理比内存中的数据处理要慢,因此该方法至少可以减少数据流的瓶颈。 means it's more faster than your method before. 意味着它比以前的方法要快。

hopelly helpfull...cheers 希望有帮助...欢呼

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM