简体   繁体   中英

Searching in files and folder is extremely slow in Python

i'll try to explain what i want to achieve with my code:

  1. i open a csv file
  2. i pick up every element of the first row and search for this string in every file in every subdirectory starting from rootdir.

with the design showed below, it is extremely slow even with 2 directories and one file in each directory. It takes approximately 1 second for each entry on the main file. i've got 400000 records on that file...

import csv
import os

rootdir = 'C:\Users\ST\Desktop\Sample'
f = open('C:\Users\ST\Desktop\inputIds.csv')
f.readline()
snipscsv_f=csv.reader(f, delimiter='    ')
for row in snipscsv_f:
    print 'processing another ID'
    for subdir, dir, files in os.walk(rootdir):
        print 'processing another folder'
        for file in files:
            print 'processing another file'
            if 'csv' in file: #i want only csv files to be processed
                ft = open(os.path.join(subdir, file))
                for ftrow in ft:
                    if row[0] in ftrow:
                        print row[0]
                ft.close()

I know you have a large CSV file but it is still MUCH quicker to read it all and compare against, rather than performing the os walk for every entry.

Also, not sure that python is the best tool for this. You may find shell scripts (for windows, Powershell is the only decent tool) much faster for this kind of task. Anyway, you added python tags so...

import csv
import fnmatch
import os

# load the csv with entries
with open('file_with_entries.csv','r') as f:
    readr = csv.reader(f)
    data = []
    for row in readr:
        data.extend(row)

# find csv files
rootdir = os.getcwd() # could be anywhere
matches = []
for root, dirs, files in os.walk(rootdir):
    for filename in fnmatch.filter(files, '*.csv'):
        matches.append(os.path.join(root, filename))

# find occurences of entry in each file
for eachcsv in matches:
    with open(eachcsv, 'r') as f:
        text = f.read()
        for entry in data:
            if entry in text:
                print("found %s in %s" % (entry,eachcsv))

Not sure how critical it is that you only read the first row of the entries file, it would be reasonably easier to amend to the code to do just that.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM