简体   繁体   中英

Most efficient way to randomly sample directory in Python

I have a directory with millions of items in it on a fairly slow disk. I want to sample 100 of those items randomly, and I want to do it using a glob as well.

One way to do it is to get a glob of every file in the directory, then sample that:

files = sorted(glob.glob('*.xml'))
file_count = len(files)
random_files = random.sample(
    range(0, file_count),
    100
)

But this is really slow because I have to build up the big list of millions of files, which has to do a lot of disk crawling.

Is there a faster way to do this that doesn't hit the disk as much? It doesn't have to be a perfectly distributed sample or even do exactly 100 items, provided it's fast.

I'm thinking that:

  • Maybe we can use the inodes to be faster?
  • Maybe we can select items without knowing the entirety of what's on disk?
  • Maybe there's some shortcut that can make this faster.

Use os.listdir instead of glob. It's twice as fast for this.

import glob, random, os,  time

n, t = 0, time.time()
files = sorted(glob.glob('tmp/*'))
file_count = len(files)
print(file_count)
random_files = random.sample(range(0, file_count), 100)
t = time.time() - t 
print "glob.glob: %.4fs, %d files found" % (t, file_count)

n, t = 0, time.time()
files = sorted(os.listdir("tmp/" ))
file_count = len(files)
print(file_count)
random_files = random.sample(range(0, file_count), 100)
t = time.time() - t 
print "os.listdir: %.4fs, %d files found" % (t, file_count)

Output

glob.glob: 0.6782s, 124729 files found
os.listdir: 0.3183s, 124778 files found

Note, if there was some information about the file names that would allow you to randomly generate them that would be the way to go. Or if you can rename the files into a format suitable for random sampling that would also work.

Maybe we can use the inodes to be faster?

not inodes, but directory entries, you don;t want to call stat() on each file

Maybe we can select items without knowing the entirety of what's on disk?

Yep, that's the plan. Open directory, read directory entries, sample 100 out of million and only then get to the files

In C this would be opendir()/readdir() calls

In python similar calls are performed by scandir , which should be included in Python 3.5 RTL. If not, get it from https://github.com/benhoyt/scandir

UPDATE

Link to Opengroup docs wrt opendir()/readdir() : http://pubs.opengroup.org/onlinepubs/009695399/functions/opendir.html

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM