I have a program that counts the frequency of lines used in a file serially. Files can be in sub-directories. Each file contains a list of Wikipedia categories, with each line being a category. I would like to know the frequency count of the categories across all files. For example a file called Los Angeles.txt
might have the following lines in it:
City
Location
And I want a tab separated file written out with the number of times each category was used in descending order:
Person 3494
City 2000
Location 1
My current code is:
import os
from collections import defaultdict
from operator import itemgetter
dir = "C:\\Wikipedia\\Categories"
l = [os.path.join(root, name) for root, _, files in os.walk(dir) for name in files]
d = defaultdict(int)
for file in l:
with open(file, encoding="utf8") as f_in:
for line in f_in:
line = line.strip() # Removes surrounding \n as well as spaces.
if line != "":
d[line] += 1
with open("C:\\Wikipedia\\category_counts.tsv", mode="w", encoding="utf8") as f_out:
for k2, v2 in sorted(d.items(), key=lambda kv: kv[1], reverse=True):
f_out.write(k2 + "\t" + str(v2) + "\n")
My question is how can I the Pool
of the multiprocessing
module to do this in a parallel way?
The issues that I'm wondering about are:
multiprocessing
module only do processes or does it do threads as well, since this is an IO bound problem? Counter
functionality from itertools
be incorporated in some way? os.walk
already execute in a parallel manner? multiprocessing
similar to multiprocessing.Value
, multiprocessing.Queue
and multiprocessing.Array
that I should be using to share the counts between the processes and thereby get an aggregated frequency count at the end? Can you use a normal Python dict
with multiprocessing
or will there be a sharing violation and corrupted data?Can anyone help with a code example?
I think I have figured it out (might be wrong, but it seems to work):
import os
from collections import defaultdict
from operator import itemgetter
from datetime import datetime
import concurrent.futures
# Loop through all the Wikipedia Article Category files and store their path and filename in a list. 1 second.
dir = "D:\\Downloads\\WikipediaAFLatest\\Categories"
l = [os.path.join(root, name) for root, _, files in os.walk(dir) for name in files]
print('After file list')
t1 = datetime.now()
d = defaultdict(int)
def do_one_file(filename):
with open(filename, encoding="utf8") as f_in:
for line in f_in:
line = line.strip() # Removes surrounding \n as well as spaces.
if line != "":
d[line] += 1
return True
# For each article (file) loop through all the categories.
with concurrent.futures.ThreadPoolExecutor() as executor:
results = executor.map(do_one_file, l) # Do do_one_file for each file in the list l. No result is returned but shared dict d is updated
t2 = datetime.now()
print('After frequency counts: ' + str(t2 - t1))
t1 = datetime.now()
with open("D:\\Downloads\\WikipediaAFLatest\\category_counts_threaded.tsv", mode="w", encoding="utf8") as f_out:
for k2, v2 in sorted(d.items(), key=lambda kv: (-kv[1], kv[0])): # Reverse sourt on count, normal sort on category
f_out.write(k2 + "\t" + str(v2) + "\n")
t2 = datetime.now()
print('After sorted frequency counts: ' + str(t2 - t1))
Answers where:
threading
instead of multiprocessing
.concurrent.futures
map
will automatically lock variables, so shared access works. Processes gets a new process and therefore can't access the current process' variables. For them use a manager as per this answer.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.