Performance: fastest way of reading in files with Python

Question

So I have about 400 files ranging from 10kb to 56mb in size, file type being .txt/.doc(x)/.pdf/.xml and I have to read them all. My read in files are basically:

#for txt files
with open("TXT\\" + path, 'r') as content_file:
    content = content_file.read().split(' ')

#for doc files using pydoc
contents = '\n'.join([para.text for para in doc.paragraphs]).encode("ascii","ignore").decode("utf-8").split(' ')

#for pdf files using pypdf2
for i in range(0, pdf.getNumPages()):
    content += pdf.getPage(i).extractText() + "\n"
    content = " ".join(content.replace(u"\xa0", " ").strip().split())
contents = content.encode("ascii","ignore").decode("utf-8").split(' ')

#for xml files using lxml
tree = etree.parse(path)
contents = etree.tostring(tree, encoding='utf8', method='text')
contents = contents.decode("utf-8").split(' ')

But I notice even reading 30 text files with under 50kb size each and doing operations on it will take 41 seconds. But If I read a single text file with 56mb takes me 9 seconds. So I'm guessing that it's the file I/O that's slowing me down instead of my program.

Any idea on how to speed up this process? Maybe break down each file type into 4 different threads? But how would you go about doing that since they are sharing the same list and that single list will be written to a file when they are done.

Answer 1

If you're blocked on file I/O, as you suspect, there's probably not much you can do.

But parallelizing to different threads might help if you have great bandwidth but terrible latency. Especially if you're dealing with, say, a networked filesystem or a multi-platter logical drive. So, it can't hurt to try.

But there's no reason to do it per file type; just use a single pool to handle all your files. For example, using the futures module:*

import concurrent.futures

with concurrent.futures.ThreadPoolExecutor(max_workers=4) as executor:
    results = executor.map(process_file, list_of_filenames)

A ThreadPoolExecutor is slightly smarter than a basic thread pool, because it lets you build composable futures, but here you don't need any of that, so I'm just using it as a basic thread pool because Python doesn't have one of those.**

The constructor creates 4 threads, and all the queues and anything else needed to manage putting tasks on those threads and getting results back.

Then, the map method just goes through each filename in list_of_filenames , creates a task out of calling process_file on that filename, submits it to the pool, and then waits for all of the tasks to finish.

In other words, this is the same as writing:

results = [process_file(filename) for filename in list_of_filenames]

… except that it uses four threads to process the files in parallel.

There are some nice examples in the docs if this isn't clear enough.

_{* If you're using Python 2.x, you'll need to install a backport before you can use this.} _{Or you can use multiprocessing.dummy.Pool instead, as noted below.}

_{** Actually, it does, in multiprocessing.dummy.Pool , but that's not very clearly documented.}

Performance: fastest way of reading in files with Python

Question

1 answers

solution1
1 ACCPTED 2014-11-26 23:04:42

Performance: fastest way of reading in files with Python

Question

1 answers

solution1 1 ACCPTED 2014-11-26 23:04:42

solution1
1 ACCPTED 2014-11-26 23:04:42