Python如何使用多处理计算子目录中大量文件中的行项目频率

Question

I have a program that counts the frequency of lines used in a file serially.我有一个程序可以连续计算文件中使用的行的频率。 Files can be in sub-directories.文件可以在子目录中。 Each file contains a list of Wikipedia categories, with each line being a category.每个文件都包含一个 Wikipedia 类别列表，每一行都是一个类别。 I would like to know the frequency count of the categories across all files.我想知道所有文件中类别的频率计数。 For example a file called Los Angeles.txt might have the following lines in it:例如，名为Los Angeles.txt的文件中可能包含以下几行：

City
Location

And I want a tab separated file written out with the number of times each category was used in descending order:我想要一个制表符分隔的文件，其中按降序排列每个类别的使用次数：

Person 3494
City 2000
Location 1

My current code is:我目前的代码是：

import os
from collections import defaultdict
from operator import itemgetter

dir = "C:\\Wikipedia\\Categories"
l = [os.path.join(root, name) for root, _, files in os.walk(dir) for name in files]

d = defaultdict(int)

for file in l:
    with open(file, encoding="utf8") as f_in:
        for line in f_in:
            line = line.strip()    # Removes surrounding \n as well as spaces.
            if line != "":
                d[line] += 1

with open("C:\\Wikipedia\\category_counts.tsv", mode="w", encoding="utf8") as f_out:    
    for k2, v2 in sorted(d.items(), key=lambda kv: kv[1], reverse=True):
        f_out.write(k2 + "\t" + str(v2) + "\n")

My question is how can I the Pool of the multiprocessing module to do this in a parallel way?我的问题是multiprocessing模块的Pool如何以并行方式执行此操作？

The issues that I'm wondering about are:我想知道的问题是：

Does the multiprocessing module only do processes or does it do threads as well, since this is an IO bound problem? multiprocessing模块是只处理进程还是也处理线程，因为这是一个 IO 绑定问题？
Can the Counter functionality from itertools be incorporated in some way? itertools的Counter功能是否可以以某种方式合并？
Does os.walk already execute in a parallel manner? os.walk是否已经以并行方式执行？
Is there some sort of dictionary functionality in multiprocessing similar to multiprocessing.Value , multiprocessing.Queue and multiprocessing.Array that I should be using to share the counts between the processes and thereby get an aggregated frequency count at the end? multiprocessing是否有某种类似于multiprocessing.Value 、 multiprocessing.Queue和multiprocessing.Array的字典功能，我应该使用它们来共享进程之间的计数，从而最终获得聚合频率计数？ Can you use a normal Python dict with multiprocessing or will there be a sharing violation and corrupted data?您可以使用具有multiprocessing的普通 Python dict还是会存在共享冲突和损坏的数据？

Can anyone help with a code example?任何人都可以帮助提供代码示例吗？

Answer 1

I think I have figured it out (might be wrong, but it seems to work):我想我已经弄清楚了（可能是错误的，但它似乎有效）：

import os
from collections import defaultdict
from operator import itemgetter
from datetime import datetime
import concurrent.futures

# Loop through all the Wikipedia Article Category files and store their path and filename in a list. 1 second.
dir = "D:\\Downloads\\WikipediaAFLatest\\Categories"
l = [os.path.join(root, name) for root, _, files in os.walk(dir) for name in files]
print('After file list')

t1 = datetime.now()

d = defaultdict(int) 

def do_one_file(filename):
    with open(filename, encoding="utf8") as f_in:
        for line in f_in:
            line = line.strip()    # Removes surrounding \n as well as spaces.
            if line != "":
                d[line] += 1
    return True

# For each article (file) loop through all the categories.
with concurrent.futures.ThreadPoolExecutor() as executor:
    results = executor.map(do_one_file, l)    # Do do_one_file for each file in the list l. No result is returned but shared dict d is updated

t2 = datetime.now()
print('After frequency counts: ' + str(t2 - t1))                    

t1 = datetime.now()
with open("D:\\Downloads\\WikipediaAFLatest\\category_counts_threaded.tsv", mode="w", encoding="utf8") as f_out:    
    for k2, v2 in sorted(d.items(), key=lambda kv: (-kv[1], kv[0])):    # Reverse sourt on count, normal sort on category
        f_out.write(k2 + "\t" + str(v2) + "\n")

t2 = datetime.now()
print('After sorted frequency counts: ' + str(t2 - t1))

Answers where:答案在哪里：

I should use threading instead of multiprocessing .我应该使用threading而不是multiprocessing 。
Threads execute in the same process and can therefore access variables.线程在同一个进程中执行，因此可以访问变量。 concurrent.futures map will automatically lock variables, so shared access works. concurrent.futures map将自动锁定变量，因此共享访问有效。 Processes gets a new process and therefore can't access the current process' variables.进程获得一个新进程，因此无法访问当前进程的变量。 For them use a manager as per this answer.对于他们来说，按照这个答案使用经理。

Python如何使用多处理计算子目录中大量文件中的行项目频率

问题描述

1 个解决方案

解决方案1
0 已采纳 2019-12-02 18:39:21

Python如何使用多处理计算子目录中大量文件中的行项目频率

问题描述

1 个解决方案

解决方案1 0 已采纳 2019-12-02 18:39:21

解决方案1
0 已采纳 2019-12-02 18:39:21