Searching in files and folder is extremely slow in Python

Question

i'll try to explain what i want to achieve with my code:

i open a csv file
i pick up every element of the first row and search for this string in every file in every subdirectory starting from rootdir.

with the design showed below, it is extremely slow even with 2 directories and one file in each directory. It takes approximately 1 second for each entry on the main file. i've got 400000 records on that file...

import csv
import os

rootdir = 'C:\Users\ST\Desktop\Sample'
f = open('C:\Users\ST\Desktop\inputIds.csv')
f.readline()
snipscsv_f=csv.reader(f, delimiter='    ')
for row in snipscsv_f:
    print 'processing another ID'
    for subdir, dir, files in os.walk(rootdir):
        print 'processing another folder'
        for file in files:
            print 'processing another file'
            if 'csv' in file: #i want only csv files to be processed
                ft = open(os.path.join(subdir, file))
                for ftrow in ft:
                    if row[0] in ftrow:
                        print row[0]
                ft.close()

Answer 1

I know you have a large CSV file but it is still MUCH quicker to read it all and compare against, rather than performing the os walk for every entry.

Also, not sure that python is the best tool for this. You may find shell scripts (for windows, Powershell is the only decent tool) much faster for this kind of task. Anyway, you added python tags so...

import csv
import fnmatch
import os

# load the csv with entries
with open('file_with_entries.csv','r') as f:
    readr = csv.reader(f)
    data = []
    for row in readr:
        data.extend(row)

# find csv files
rootdir = os.getcwd() # could be anywhere
matches = []
for root, dirs, files in os.walk(rootdir):
    for filename in fnmatch.filter(files, '*.csv'):
        matches.append(os.path.join(root, filename))

# find occurences of entry in each file
for eachcsv in matches:
    with open(eachcsv, 'r') as f:
        text = f.read()
        for entry in data:
            if entry in text:
                print("found %s in %s" % (entry,eachcsv))

Not sure how critical it is that you only read the first row of the entries file, it would be reasonably easier to amend to the code to do just that.

Searching in files and folder is extremely slow in Python

Question

1 answers

solution1
1 ACCPTED 2016-04-12 16:11:38

Searching in files and folder is extremely slow in Python

Question

1 answers

solution1 1 ACCPTED 2016-04-12 16:11:38

solution1
1 ACCPTED 2016-04-12 16:11:38