简体   繁体   English

python中的硒和多处理

[英]Selenium and multiprocessing in python

In my django app I use selenium for crawling and parsing some html page. 在我的django应用程序中,我使用硒来爬网和解析一些html页面。 I tried to introduce the multiprocess to improve performance. 我试图介绍多进程以提高性能。 This is my code: 这是我的代码:

import os
from selenium import webdriver
from multiprocessing import Pool

os.environ["DISPLAY"]=":56017"

def render_js(url):
    driver = webdriver.Firefox()
    driver.set_page_load_timeout(300)
    driver.get(url)
    text = driver.page_source
    driver.quit()
    return text

def parsing(url):
    text = render_js(url)
    ... parsing the text ....
    ... write in db.... 


url_list = ['www.google.com','www.python.com','www.microsoft.com']
pool = Pool(processes=2)
pool.map_async(parsing, url_list)
pool.close()
pool.join()

I have this error when two processes work together simultaneously and use selenium: the first process starts firefox with 'www.google.it' and it returns the correct text, the second with url 'www.python.com' returns the text of www.google.it and not of www.python.com. 当两个进程同时工作并使用硒时出现此错误:第一个进程以'www.google.it'启动Firefox,返回正确的文本,第二个以url'www.python.com'返回www的文本.google.it,而不是www.python.com。 Can you tell me where I'm wrong? 你能告诉我我错了吗?

from selenium import webdriver
from multiprocessing import Pool

def parsing(url):
    driver = webdriver.Chrome()
    driver.set_page_load_timeout(300)
    driver.get(url)
    text = driver.page_source
    driver.close()
    return text

url_list = ['http://www.google.com', 'http://www.python.com']
pool = Pool(processes=4)
ret = pool.map(parsing, url_list)
for text in ret:
    print text[:30]

I tried running your code and Selenium complained about bad urls. 我尝试运行您的代码,Selenium抱怨网址错误。 Adding http:// to it made it work. 向其添加http://使其起作用。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM