[英]unable to exploit multiple cores with Multiprocessing in python
我试图利用python多处理库利用计算机中的多个内核来处理数百万个文件中的文本。 下面的代码显示了主路径和辅助函数,它们在输入的重新运行字(1000个以下)中使用路径,该字在路径中的所有文件中最多出现(函数没有错,仅供参考)。
import multiprocessing as mp
def worker(fileNameList):
'''Takes a file name list and reruns a word frequency map of all the files in a dict'''
vacob=dict()
for fileName in fileNameList:
xmlfile=open(fileName)
tree=html.fromstring(xmlfile.read())
paras=tree.xpath("//title/text()|//headline/text()|//text/p/text()")
docString="".join(paras)
wordList=preprocess_pipeline(docString)
for word in wordList:
if vacob.has_key(word):
vacob[word]=vacob[word]+1
else:
vacob[word]=1
xmlfile.close()
output.put(vacob)
def master(path,n=8):
'''Takes a path as input and returns a vocabulary of(10000 or less) words for all the files in the path'''
vacob=defaultdict(int)
xmlFiles=[f for f in listdir(path) if isfile(join(path,f)) and os.path.splitext(f)[1]=='.xml']
length=len(xmlFiles)
parts=length/n
processes=list()
for i in range(n):
processes.append(mp.Process(target=worker,args=[xmlFiles[i*parts:(i+1)*parts]]))
for i in processes:
i.start()
for i in processes:
i.join()
for j in range(n):
results=output.get()
for word in results:
vacob[word]+=1
vacob=sorted(vacob,key=vacob.get,reverse=True)
if(len(vacob)<10000):
return vacob
else:
return vacob[:10000]
output=mp.Queue()
vocab=master(path)
这应该利用我计算机的所有8个核心。 但是所有进程都只共享我的CPU的一个内核。下图显示了我的textprocessing.py脚本产生的所有进程都只使用一个内核。如何使脚本使用所有可用的内核?
当我尝试调试打印每个工作人员正在处理的文件时。 它似乎利用了所有核心。 但是我仍然不明白为什么简单的print语句会利用所有核心。
这是带有调试打印的修改后的代码。
import multiprocessing as mp
def worker(fileNameList,no):
'''Takes a file name list and reruns a word frequency map of all the files in a dict'''
vacob=dict()
for fileName in fileNameList:
print "processing ",fileName," worker",no
xmlfile=open(fileName)
tree=html.fromstring(xmlfile.read())
paras=tree.xpath("//title/text()|//headline/text()|//text/p/text()")
docString="".join(paras)
wordList=preprocess_pipeline(docString)
for word in wordList:
if vacob.has_key(word):
vacob[word]=vacob[word]+1
else:
vacob[word]=1
xmlfile.close()
output.put(vacob)
def master(path,n=8):
'''Takes a path as input and returns a vocabulary of(10000 or less) words for all the files in the path'''
vacob=defaultdict(int)
xmlFiles=[f for f in listdir(path) if isfile(join(path,f)) and os.path.splitext(f)[1]=='.xml']
length=len(xmlFiles)
parts=length/n
processes=list()
for i in range(n):
processes.append(mp.Process(target=worker,args=[xmlFiles[i*parts:(i+1)*parts],i]))
for i in processes:
i.start()
for i in processes:
i.join()
for j in range(n):
results=output.get()
for word in results:
vacob[word]+=1
vacob=sorted(vacob,key=vacob.get,reverse=True)
if(len(vacob)<10000):
return vacob
else:
return vacob[:10000]
output=mp.Queue()
vocab=master(path)
这是htop和控制台的屏幕截图:
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.