简体   繁体   中英

Python: How do I go through multiple lines in a file at once?

I have created a class which loops through a file and after checking if a line is valid, it'll write that line to another file. Every line it checks is a lengthy process making it very slow. I need to implement either threading/multiprocessing at the process_file function; I do not know which library is best suited for speeding this function up or how to implement it.

class FileProcessor:
    def process_file(self):
        with open('file.txt', 'r') as f:
            with open('outfile.txt', 'w') as output:
                for line in f:
                    # There's some string manipulation code here...
                    validate = FileProcessor.do_stuff(self, line)
                    # If true write line to output.txt
    def do_stuff(self, line)
        # Does stuff...
        pass 

Extra Information: The code goes through a proxy list checking whether it is online. This is a lengthy and time consuming process.

Thank you for any insight or help!

The code goes through a proxy list checking whether it is online

It sounds like what takes a long time is connecting to the internet, meaning your task is IO bound and thus threads can help speed it up. Multiple processes are always applicable but can be harder to use.

This seems like a job for multiprocessing.map .

import multiprocessing

def process_file(filename):
    pool = multiprocessing.Pool(4)
    with open(filename) as fd:
        results = pool.imap_unordered(do_stuff, (line for line in fd))
        with open("output.txt", "w") as fd:
            for r in results:
                fd.write(r)

def do_stuff(item):
    return "I did something with %s\n" % item

process_file(__file__)

You can also use multiprocessing.dummy.Pool instead if you want to use threads (which might be preferable in this case since your are I/O bound).

Essentially you are passing an iterable to imap_unordered (or imap if order matters) and farming out portions of it to other processes (or threads if using dummy). You can tune the chunksize of the map to help with efficiency.

If you want to encapsulate this into a class, you'll need to use multiprocessing.dummy . (Otherwise it can't pickle the instance method.)

You do have to wait until the map finishes before you can process the results, although you could write the results in do_stuff instead -- just be sure to open the file in append mode, and you'll likely want to lock the file .

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM