[英]python2.7 + multiprocessing + selenium: restart process on exception
我似乎對使用多處理的python腳本有疑問。 它實際上要做的是獲取ID代碼列表,並啟動使用Selenium和PhantomJS作為驅動程序的進程,以導航到包含該ID代碼的URL,將數據提取到單個csv文件中,然后在所有進程完成后編譯另一個csv文件。 一切運行良好,但有時其中一個進程將返回一個異常,內容為:
Traceback (most recent call last):
File "/usr/lib/python2.7/multiprocessing/process.py", line 258, in _bootstrap
self.run()
File "/usr/lib/python2.7/multiprocessing/process.py", line 114, in run
self._target(*self._args, **self._kwargs)
File "modtest.py", line 11, in worker
do_work(item)
File "/home/mdrouin/Dropbox/Work/Dev/Python/WynInvScrape/items.py", line 14, in do_work
driver = webdriver.PhantomJS()
File "/usr/lib/python2.7/site-packages/selenium/webdriver/phantomjs/webdriver.py", line 50, in __init__
self.service.start()
File "/usr/lib/python2.7/site-packages/selenium/webdriver/phantomjs/service.py", line 72, in start
raise WebDriverException("Can not connect to GhostDriver")
我嘗試過以某種方式在出現異常的情況下重新啟動過程,但是無論如何,似乎正在發生的事情是一旦過程完成,程序掛起且無法繼續運行,或為此執行任何操作。 我本質上想在進程崩潰時重新啟動正在搜索的ID號,並在所有進程完成后繼續進行。 這是代碼的極簡縮版:
from selenium import webdriver
from time import sleep
from bs4 import BeautifulSoup as bs
import multiprocessing
import datetime, time, csv, glob
num_procs = 8
def do_work(rsrt):
driver = webdriver.PhantomJS()
try:
driver.get('http://www.example.com/get.php?resort=' + rsrt)
rows = []
for row in soup.find_all('tr'):
if row.find('input', {'name': 'booksubmit'}):
wyncheckin = row.find('td', {'class': 'searchAvailDate'}).string
wynnights = row.find('td', {'class': 'searchAvailNights'}).string
wynroom = row.find('td', {'class': 'searchAvailUnitType'}).string
rows.append([wynresort, wyncheckin, wynroom])
driver.quit()
with open('/home/mdrouin/Dropbox/Work/Dev/Python/WynInvScrape/availability/'+rsrt+'.csv', 'wb') as f:
writer = csv.writer(f)
writer.writerows(row for row in rows if row)
print 'Process ' + rsrt + ' End: ' + str(time.strftime('%c'))
except:
driver.quit()
def worker():
for item in iter( q.get, None ):
do_work(item)
q.task_done()
q.task_done()
q = multiprocessing.JoinableQueue()
procs = []
for i in range(num_procs):
procs.append( multiprocessing.Process(target=worker) )
procs[-1].daemon = True
procs[-1].start()
source = ['0017', '0113', '0020', '0013', '0038', '1028', '0115', '0105', '0041', '0037', '0043', '2026', '0165', '0164',
'0033', '0126', '0116', '0103', '9135', '0185', '0206', '0053', '0062', '1020', '0019', '0042', '2028', '0213',
'0211', '0163', '0073', '2020', '0214', '2140', '0084', '0193', '0095', '0064', '0196', '0028', '0068', '0074']
for item in source:
q.put(item)
q.join()
for p in procs:
q.put( None )
q.join()
for p in procs:
p.join()
print "Finished"
print 'Writting core output: ' + str(time.strftime('%c'))
with open('availability.csv', 'wb') as outfile:
for csvfile in glob.glob('/home/mdrouin/Dropbox/Work/Dev/Python/WynInvScrape/availability/*.csv'):
for line in open(csvfile, 'r'):
outfile.write(line)
print 'Process End: ' + str(time.strftime('%c'))
解決此類問題的方法之一是反復調用自身,類似於以下內容:
def do_work(rsrt):
if failed:
return do_work(rsrt)
當然,它會一直運行到解析為止,因此您可能需要傳遞一個計數器,如果它高於某個值,則返回false。
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.