简体   繁体   中英

Python how to count line item frequency in large number of files in subdirectories using multiprocessing

I have a program that counts the frequency of lines used in a file serially. Files can be in sub-directories. Each file contains a list of Wikipedia categories, with each line being a category. I would like to know the frequency count of the categories across all files. For example a file called Los Angeles.txt might have the following lines in it:

City
Location

And I want a tab separated file written out with the number of times each category was used in descending order:

Person 3494
City 2000
Location 1

My current code is:

import os
from collections import defaultdict
from operator import itemgetter

dir = "C:\\Wikipedia\\Categories"
l = [os.path.join(root, name) for root, _, files in os.walk(dir) for name in files]

d = defaultdict(int)

for file in l:
    with open(file, encoding="utf8") as f_in:
        for line in f_in:
            line = line.strip()    # Removes surrounding \n as well as spaces.
            if line != "":
                d[line] += 1

with open("C:\\Wikipedia\\category_counts.tsv", mode="w", encoding="utf8") as f_out:    
    for k2, v2 in sorted(d.items(), key=lambda kv: kv[1], reverse=True):
        f_out.write(k2 + "\t" + str(v2) + "\n")

My question is how can I the Pool of the multiprocessing module to do this in a parallel way?

The issues that I'm wondering about are:

  • Does the multiprocessing module only do processes or does it do threads as well, since this is an IO bound problem?
  • Can the Counter functionality from itertools be incorporated in some way?
  • Does os.walk already execute in a parallel manner?
  • Is there some sort of dictionary functionality in multiprocessing similar to multiprocessing.Value , multiprocessing.Queue and multiprocessing.Array that I should be using to share the counts between the processes and thereby get an aggregated frequency count at the end? Can you use a normal Python dict with multiprocessing or will there be a sharing violation and corrupted data?

Can anyone help with a code example?

I think I have figured it out (might be wrong, but it seems to work):

import os
from collections import defaultdict
from operator import itemgetter
from datetime import datetime
import concurrent.futures

# Loop through all the Wikipedia Article Category files and store their path and filename in a list. 1 second.
dir = "D:\\Downloads\\WikipediaAFLatest\\Categories"
l = [os.path.join(root, name) for root, _, files in os.walk(dir) for name in files]
print('After file list')

t1 = datetime.now()

d = defaultdict(int) 

def do_one_file(filename):
    with open(filename, encoding="utf8") as f_in:
        for line in f_in:
            line = line.strip()    # Removes surrounding \n as well as spaces.
            if line != "":
                d[line] += 1
    return True

# For each article (file) loop through all the categories.
with concurrent.futures.ThreadPoolExecutor() as executor:
    results = executor.map(do_one_file, l)    # Do do_one_file for each file in the list l. No result is returned but shared dict d is updated

t2 = datetime.now()
print('After frequency counts: ' + str(t2 - t1))                    

t1 = datetime.now()
with open("D:\\Downloads\\WikipediaAFLatest\\category_counts_threaded.tsv", mode="w", encoding="utf8") as f_out:    
    for k2, v2 in sorted(d.items(), key=lambda kv: (-kv[1], kv[0])):    # Reverse sourt on count, normal sort on category
        f_out.write(k2 + "\t" + str(v2) + "\n")

t2 = datetime.now()
print('After sorted frequency counts: ' + str(t2 - t1))

Answers where:

  • I should use threading instead of multiprocessing .
  • Threads execute in the same process and can therefore access variables. concurrent.futures map will automatically lock variables, so shared access works. Processes gets a new process and therefore can't access the current process' variables. For them use a manager as per this answer.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM