简体   繁体   English

Python如何使用多处理计算子目录中大量文件中的行项目频率

[英]Python how to count line item frequency in large number of files in subdirectories using multiprocessing

I have a program that counts the frequency of lines used in a file serially.我有一个程序可以连续计算文件中使用的行的频率。 Files can be in sub-directories.文件可以在子目录中。 Each file contains a list of Wikipedia categories, with each line being a category.每个文件都包含一个 Wikipedia 类别列表,每一行都是一个类别。 I would like to know the frequency count of the categories across all files.我想知道所有文件中类别的频率计数。 For example a file called Los Angeles.txt might have the following lines in it:例如,名为Los Angeles.txt的文件中可能包含以下几行:

City
Location

And I want a tab separated file written out with the number of times each category was used in descending order:我想要一个制表符分隔的文件,其中按降序排列每个类别的使用次数:

Person 3494
City 2000
Location 1

My current code is:我目前的代码是:

import os
from collections import defaultdict
from operator import itemgetter

dir = "C:\\Wikipedia\\Categories"
l = [os.path.join(root, name) for root, _, files in os.walk(dir) for name in files]

d = defaultdict(int)

for file in l:
    with open(file, encoding="utf8") as f_in:
        for line in f_in:
            line = line.strip()    # Removes surrounding \n as well as spaces.
            if line != "":
                d[line] += 1

with open("C:\\Wikipedia\\category_counts.tsv", mode="w", encoding="utf8") as f_out:    
    for k2, v2 in sorted(d.items(), key=lambda kv: kv[1], reverse=True):
        f_out.write(k2 + "\t" + str(v2) + "\n")

My question is how can I the Pool of the multiprocessing module to do this in a parallel way?我的问题是multiprocessing模块的Pool如何以并行方式执行此操作?

The issues that I'm wondering about are:我想知道的问题是:

  • Does the multiprocessing module only do processes or does it do threads as well, since this is an IO bound problem? multiprocessing模块是只处理进程还是也处理线程,因为这是一个 IO 绑定问题?
  • Can the Counter functionality from itertools be incorporated in some way? itertoolsCounter功能是否可以以某种方式合并?
  • Does os.walk already execute in a parallel manner? os.walk是否已经以并行方式执行?
  • Is there some sort of dictionary functionality in multiprocessing similar to multiprocessing.Value , multiprocessing.Queue and multiprocessing.Array that I should be using to share the counts between the processes and thereby get an aggregated frequency count at the end? multiprocessing是否有某种类似于multiprocessing.Valuemultiprocessing.Queuemultiprocessing.Array的字典功能,我应该使用它们来共享进程之间的计数,从而最终获得聚合频率计数? Can you use a normal Python dict with multiprocessing or will there be a sharing violation and corrupted data?您可以使用具有multiprocessing的普通 Python dict还是会存在共享冲突和损坏的数据?

Can anyone help with a code example?任何人都可以帮助提供代码示例吗?

I think I have figured it out (might be wrong, but it seems to work):我想我已经弄清楚了(可能是错误的,但它似乎有效):

import os
from collections import defaultdict
from operator import itemgetter
from datetime import datetime
import concurrent.futures

# Loop through all the Wikipedia Article Category files and store their path and filename in a list. 1 second.
dir = "D:\\Downloads\\WikipediaAFLatest\\Categories"
l = [os.path.join(root, name) for root, _, files in os.walk(dir) for name in files]
print('After file list')

t1 = datetime.now()

d = defaultdict(int) 

def do_one_file(filename):
    with open(filename, encoding="utf8") as f_in:
        for line in f_in:
            line = line.strip()    # Removes surrounding \n as well as spaces.
            if line != "":
                d[line] += 1
    return True

# For each article (file) loop through all the categories.
with concurrent.futures.ThreadPoolExecutor() as executor:
    results = executor.map(do_one_file, l)    # Do do_one_file for each file in the list l. No result is returned but shared dict d is updated

t2 = datetime.now()
print('After frequency counts: ' + str(t2 - t1))                    

t1 = datetime.now()
with open("D:\\Downloads\\WikipediaAFLatest\\category_counts_threaded.tsv", mode="w", encoding="utf8") as f_out:    
    for k2, v2 in sorted(d.items(), key=lambda kv: (-kv[1], kv[0])):    # Reverse sourt on count, normal sort on category
        f_out.write(k2 + "\t" + str(v2) + "\n")

t2 = datetime.now()
print('After sorted frequency counts: ' + str(t2 - t1))

Answers where:答案在哪里:

  • I should use threading instead of multiprocessing .我应该使用threading而不是multiprocessing
  • Threads execute in the same process and can therefore access variables.线程在同一个进程中执行,因此可以访问变量。 concurrent.futures map will automatically lock variables, so shared access works. concurrent.futures map将自动锁定变量,因此共享访问有效。 Processes gets a new process and therefore can't access the current process' variables.进程获得一个新进程,因此无法访问当前进程的变量。 For them use a manager as per this answer.对于他们来说,按照这个答案使用经理。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM