简体   繁体   中英

Reading A Big File With Python

I'm trying to read some files in a directory, which has 10 text files. With time, the number of files increases, and the total size as of now goes around 400MB.

File contents are in the format:

student_name:student_ID:date_of_join:anotherfield1:anotherfield2

In case of a match, I have to print out the whole line. Here's what I've tried.

findvalue = "student_id" #this is users input alphanumeric
directory = "./RecordFolder"
for filename in os.listdir(directory):
    with open(os.path.join(directory, filename)) as f:
        for line in f:
            if findvalue in line:
                print line

This works, but it takes a lot of time. How can I reduce the run time?

When textfiles become too slow, you need to start looking at databases. One of the main purposes of databases is to intelligently handle IO from persistent data storage.

Depending on the needs of your application, SQLite may be a good fit. I suspect this is what you want, given that you don't seem to have a gargantuan data set. From there, it's just a matter of making database API calls and allowing SQLite to handle the lookups -- it does so much better than you!

If (for some strange reason) you really don't want to use a database, then consider further breaking up your data into a tree, if at all possible. For example, you could have a file for each letter of the alphabet in which you put student data. This should cut down on looping time since you're reducing the number of students per file. This is a quick hack, but I think you'll lose less hair if you go with a database.

IO is notoriously slow compared to computation, and given that you are dealing with large files it's probably best deal with the files line by line. I don't see an obvious easy way to speed this up in Python.

Depending on how frequent your "hits" (ie, findvalue in line ) will be you may decide to write to a file so not to be possibly slowed down by console output, but if there will be relatively few items found, it wouldn't make much of a difference.

I think for Python there's nothing obvious and major you can do. You could always explore other tools (such as grep or databases ...) as alternative approaches.

PS: No need for the else:pass ..

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM