[英]Python how to count line item frequency in large number of files in subdirectories using multiprocessing
I have a program that counts the frequency of lines used in a file serially.我有一个程序可以连续计算文件中使用的行的频率。 Files can be in sub-directories.文件可以在子目录中。 Each file contains a list of Wikipedia categories, with each line being a category.每个文件都包含一个 Wikipedia 类别列表,每一行都是一个类别。 I would like to know the frequency count of the categories across all files.我想知道所有文件中类别的频率计数。 For example a file called Los Angeles.txt
might have the following lines in it:例如,名为Los Angeles.txt
的文件中可能包含以下几行:
City
Location
And I want a tab separated file written out with the number of times each category was used in descending order:我想要一个制表符分隔的文件,其中按降序排列每个类别的使用次数:
Person 3494
City 2000
Location 1
My current code is:我目前的代码是:
import os
from collections import defaultdict
from operator import itemgetter
dir = "C:\\Wikipedia\\Categories"
l = [os.path.join(root, name) for root, _, files in os.walk(dir) for name in files]
d = defaultdict(int)
for file in l:
with open(file, encoding="utf8") as f_in:
for line in f_in:
line = line.strip() # Removes surrounding \n as well as spaces.
if line != "":
d[line] += 1
with open("C:\\Wikipedia\\category_counts.tsv", mode="w", encoding="utf8") as f_out:
for k2, v2 in sorted(d.items(), key=lambda kv: kv[1], reverse=True):
f_out.write(k2 + "\t" + str(v2) + "\n")
My question is how can I the Pool
of the multiprocessing
module to do this in a parallel way?我的问题是multiprocessing
模块的Pool
如何以并行方式执行此操作?
The issues that I'm wondering about are:我想知道的问题是:
multiprocessing
module only do processes or does it do threads as well, since this is an IO bound problem? multiprocessing
模块是只处理进程还是也处理线程,因为这是一个 IO 绑定问题?Counter
functionality from itertools
be incorporated in some way? itertools
的Counter
功能是否可以以某种方式合并?os.walk
already execute in a parallel manner? os.walk
是否已经以并行方式执行?multiprocessing
similar to multiprocessing.Value
, multiprocessing.Queue
and multiprocessing.Array
that I should be using to share the counts between the processes and thereby get an aggregated frequency count at the end? multiprocessing
是否有某种类似于multiprocessing.Value
、 multiprocessing.Queue
和multiprocessing.Array
的字典功能,我应该使用它们来共享进程之间的计数,从而最终获得聚合频率计数? Can you use a normal Python dict
with multiprocessing
or will there be a sharing violation and corrupted data?您可以使用具有multiprocessing
的普通 Python dict
还是会存在共享冲突和损坏的数据?Can anyone help with a code example?任何人都可以帮助提供代码示例吗?
I think I have figured it out (might be wrong, but it seems to work):我想我已经弄清楚了(可能是错误的,但它似乎有效):
import os
from collections import defaultdict
from operator import itemgetter
from datetime import datetime
import concurrent.futures
# Loop through all the Wikipedia Article Category files and store their path and filename in a list. 1 second.
dir = "D:\\Downloads\\WikipediaAFLatest\\Categories"
l = [os.path.join(root, name) for root, _, files in os.walk(dir) for name in files]
print('After file list')
t1 = datetime.now()
d = defaultdict(int)
def do_one_file(filename):
with open(filename, encoding="utf8") as f_in:
for line in f_in:
line = line.strip() # Removes surrounding \n as well as spaces.
if line != "":
d[line] += 1
return True
# For each article (file) loop through all the categories.
with concurrent.futures.ThreadPoolExecutor() as executor:
results = executor.map(do_one_file, l) # Do do_one_file for each file in the list l. No result is returned but shared dict d is updated
t2 = datetime.now()
print('After frequency counts: ' + str(t2 - t1))
t1 = datetime.now()
with open("D:\\Downloads\\WikipediaAFLatest\\category_counts_threaded.tsv", mode="w", encoding="utf8") as f_out:
for k2, v2 in sorted(d.items(), key=lambda kv: (-kv[1], kv[0])): # Reverse sourt on count, normal sort on category
f_out.write(k2 + "\t" + str(v2) + "\n")
t2 = datetime.now()
print('After sorted frequency counts: ' + str(t2 - t1))
Answers where:答案在哪里:
threading
instead of multiprocessing
.我应该使用threading
而不是multiprocessing
。concurrent.futures
map
will automatically lock variables, so shared access works. concurrent.futures
map
将自动锁定变量,因此共享访问有效。 Processes gets a new process and therefore can't access the current process' variables.进程获得一个新进程,因此无法访问当前进程的变量。 For them use a manager as per this answer.对于他们来说,按照这个答案使用经理。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.