简体   繁体   中英

How to efficiently test if files with a matching filename (regex or wildcard) exists in a directory?

I am searching for an efficient way to test if files exists which have a file-name of a certain pattern.

Examples using wildcards:

  • ????.*
  • ???????.*
  • *.png
  • *.jpg

Examples using regular expressions:

  • [012]{4}.*
  • [012]{7}.*

The problem is that the directory I have to test contains up to 500.000 files. The only way I know to perform such tests is to use the methods of the File class:

String[] list()
String[] list(FilenameFilter filter)
File[] listFiles()
File[] listFiles(FileFilter filter)
File[] listFiles(FilenameFilter filter)

The problem is that basically they are all implemented the same way: First the call list() for getting all available files and the they apply the filter on it.

Please imagine yourself what happens if we want to apply this on a folder containing 500.000 files...

If there any alternative in Java for retrieving the filename of the first matching file regarding files in a directory without having to enumerate all of them?

If JNI is the only option - is there a library can do this that comes with pre-compiled binaries for the six major platforms (Linux, Windows and OSX each 32 and 64 bit)?

I think that you are confused. As far as I know, no current OS supports pattern listing/searching in its filesystem interface. All utilities that support patterns do so by listing the directory (eg by using readdir() on POSIX systems) and then performing string matching.

Therefore, there is no generic low-level way to do that more efficiently in Java or any other language. That said, you should investigate at least the following approaches:

  • making sure that you only retrieve the file names and that you do not probe the file nodes themselves for additional metadata (eg their size), as that would cause additional operations for each file.

  • retrieving the file list once and caching it, perhaps in association with a filesystem event notification interface for updates (eg JNotify or the Java 7 WatchService interface ).

EDIT:

I had a look at my Java implementation. The only obvious drawback in the methods of the File class is that listing a directory does not stop once a match is found. That would only matter, however, if you only perform the search once - otherwise it would still be far more efficient to cache the full directory list.

If you can use a relatively recent Java version, you might want to have a look at the Java NIO classes ( 1 , 2 ) which do not seem to have the same weakness.

this takes about 1 minute on my machine (which is sorta old)

import java.io.*;
import java.util.*;
import java.util.regex.*;
public class Main {
    static void match(File dir, Pattern pattern, List<File> matching) {
        File[] files = dir.listFiles();
        if(files==null) {
            System.out.println(dir + " is strange!");
            return;
        }
        for (File file : files)
            if (file.isDirectory()) match(file, pattern, matching);
            else if (file.isFile()) {
                Matcher matcher = pattern.matcher(file.getName());
                if (matcher.matches()) {
                    matching.add(file);
                    //System.out.println(file + "************");
                }
            }
    }
    static void makeFiles(File dir,int n) throws IOException {
        for(int i=0;i<n;i++) {
            File file=new File(dir,i+".foo");
            FileWriter fw=new FileWriter(file);
            fw.write(1);
            fw.close();
        }
    }
    public static void main(String[] args) throws IOException {
        File dir = new File("data");
        final int n=500000;
        //makeFiles(dir,n);
        long t0=System.currentTimeMillis();
        Pattern pattern = Pattern.compile(".*\\.foo");
        List<File> matching = new LinkedList<File>();
        match(dir, pattern, matching);
        long t1=System.currentTimeMillis();
        System.out.println("found: "+matching.size());
        System.out.println("elapsed time: "+(t1-t0)/1000.);
        System.out.println("files/second: "+n/((t1-t0)/1000.));
    }
}

I think you are putting the proverbial cart before the horse.

  1. As Knuth said, premature optimization is the root of all evil. Have you tried using the FileFilter method and found that it is too slow for the application?

  2. Why do you have so many files in one folder? Perhaps the more beneficial approach would be to split those files up in some manner instead of having them all in one folder.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM