I have created a class which loops through a file and after checking if a line is valid, it'll write that line to another file. Every line it checks is a lengthy process making it very slow. I need to implement either threading/multiprocessing at the process_file function; I do not know which library is best suited for speeding this function up or how to implement it.
class FileProcessor:
def process_file(self):
with open('file.txt', 'r') as f:
with open('outfile.txt', 'w') as output:
for line in f:
# There's some string manipulation code here...
validate = FileProcessor.do_stuff(self, line)
# If true write line to output.txt
def do_stuff(self, line)
# Does stuff...
pass
Extra Information: The code goes through a proxy list checking whether it is online. This is a lengthy and time consuming process.
Thank you for any insight or help!
The code goes through a proxy list checking whether it is online
It sounds like what takes a long time is connecting to the internet, meaning your task is IO bound and thus threads can help speed it up. Multiple processes are always applicable but can be harder to use.
This seems like a job for multiprocessing.map
.
import multiprocessing
def process_file(filename):
pool = multiprocessing.Pool(4)
with open(filename) as fd:
results = pool.imap_unordered(do_stuff, (line for line in fd))
with open("output.txt", "w") as fd:
for r in results:
fd.write(r)
def do_stuff(item):
return "I did something with %s\n" % item
process_file(__file__)
You can also use multiprocessing.dummy.Pool
instead if you want to use threads (which might be preferable in this case since your are I/O bound).
Essentially you are passing an iterable to imap_unordered
(or imap
if order matters) and farming out portions of it to other processes (or threads if using dummy). You can tune the chunksize
of the map to help with efficiency.
If you want to encapsulate this into a class, you'll need to use multiprocessing.dummy
. (Otherwise it can't pickle the instance method.)
You do have to wait until the map finishes before you can process the results, although you could write the results in do_stuff
instead -- just be sure to open the file in append mode, and you'll likely want to lock the file .
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.