简体   繁体   English

如何有效地测试目录中是否存在具有匹配文件名(正则表达式或通配符)的文件?

[英]How to efficiently test if files with a matching filename (regex or wildcard) exists in a directory?

I am searching for an efficient way to test if files exists which have a file-name of a certain pattern. 我正在寻找一种有效的方法来测试是否存在具有特定模式文件名的文件。

Examples using wildcards: 使用通配符的示例:

  • ????.* ???? *。
  • ???????.* ???????。*
  • *.png * .PNG
  • *.jpg * .JPG

Examples using regular expressions: 使用正则表达式的示例:

  • [012]{4}.* [012] {4}。*
  • [012]{7}.* [012] {7}。*

The problem is that the directory I have to test contains up to 500.000 files. 问题是我必须测试的目录最多包含500.000个文件。 The only way I know to perform such tests is to use the methods of the File class: 我知道执行此类测试的唯一方法是使用File类的方法:

String[] list()
String[] list(FilenameFilter filter)
File[] listFiles()
File[] listFiles(FileFilter filter)
File[] listFiles(FilenameFilter filter)

The problem is that basically they are all implemented the same way: First the call list() for getting all available files and the they apply the filter on it. 问题在于,基本上它们都是以相同的方式实现的:首先,调用list()以获取所有可用文件,然后它们在其上应用过滤器。

Please imagine yourself what happens if we want to apply this on a folder containing 500.000 files... 请想象一下,如果我们想将此应用到包含500.000个文件的文件夹中,将会发生什么...

If there any alternative in Java for retrieving the filename of the first matching file regarding files in a directory without having to enumerate all of them? Java中是否有其他方法可以检索与目录中的文件有关的第一个匹配文件的文件名,而不必枚举所有文件?

If JNI is the only option - is there a library can do this that comes with pre-compiled binaries for the six major platforms (Linux, Windows and OSX each 32 and 64 bit)? 如果JNI是唯一的选择-是否有一个库可以做到这一点,它带有针对六个主要平台(分别为32位和64位的Linux,Windows和OSX)的预编译二进制文件?

I think that you are confused. 我觉得你很困惑。 As far as I know, no current OS supports pattern listing/searching in its filesystem interface. 据我所知,当前没有操作系统在其文件系统界面中支持模式列表/搜索。 All utilities that support patterns do so by listing the directory (eg by using readdir() on POSIX systems) and then performing string matching. 所有支持模式的实用程序都通过列出目录(例如,在POSIX系统上使用readdir()来执行此操作,然后执行字符串匹配。

Therefore, there is no generic low-level way to do that more efficiently in Java or any other language. 因此,没有通用的底层方法可以用Java或任何其他语言更有效地执行此操作。 That said, you should investigate at least the following approaches: 也就是说,您应该至少研究以下方法:

  • making sure that you only retrieve the file names and that you do not probe the file nodes themselves for additional metadata (eg their size), as that would cause additional operations for each file. 请确保仅检索文件名,并且不对文件节点本身进行探测以获取其他元数据(例如其大小),因为这将导致每个文件的附加操作。

  • retrieving the file list once and caching it, perhaps in association with a filesystem event notification interface for updates (eg JNotify or the Java 7 WatchService interface ). 检索一次文件列表并缓存它,可能与文件系统事件通知接口(例如JNotifyJava 7 WatchService接口 )进行更新关联。

EDIT: 编辑:

I had a look at my Java implementation. 我看了看我的Java实现。 The only obvious drawback in the methods of the File class is that listing a directory does not stop once a match is found. File类方法中唯一明显的缺点是,一旦找到匹配项,列出目录就不会停止。 That would only matter, however, if you only perform the search once - otherwise it would still be far more efficient to cache the full directory list. 但是,这仅是重要的,如果您只执行一次搜索-否则缓存整个目录列表仍然会更加有效。

If you can use a relatively recent Java version, you might want to have a look at the Java NIO classes ( 1 , 2 ) which do not seem to have the same weakness. 如果你可以使用一个相对较新的Java版本,你可能想看看在Java NIO类( 12 ),这似乎并不具有相同的弱点。

this takes about 1 minute on my machine (which is sorta old) 这在我的机器上花了大约1分钟(有点旧)

import java.io.*;
import java.util.*;
import java.util.regex.*;
public class Main {
    static void match(File dir, Pattern pattern, List<File> matching) {
        File[] files = dir.listFiles();
        if(files==null) {
            System.out.println(dir + " is strange!");
            return;
        }
        for (File file : files)
            if (file.isDirectory()) match(file, pattern, matching);
            else if (file.isFile()) {
                Matcher matcher = pattern.matcher(file.getName());
                if (matcher.matches()) {
                    matching.add(file);
                    //System.out.println(file + "************");
                }
            }
    }
    static void makeFiles(File dir,int n) throws IOException {
        for(int i=0;i<n;i++) {
            File file=new File(dir,i+".foo");
            FileWriter fw=new FileWriter(file);
            fw.write(1);
            fw.close();
        }
    }
    public static void main(String[] args) throws IOException {
        File dir = new File("data");
        final int n=500000;
        //makeFiles(dir,n);
        long t0=System.currentTimeMillis();
        Pattern pattern = Pattern.compile(".*\\.foo");
        List<File> matching = new LinkedList<File>();
        match(dir, pattern, matching);
        long t1=System.currentTimeMillis();
        System.out.println("found: "+matching.size());
        System.out.println("elapsed time: "+(t1-t0)/1000.);
        System.out.println("files/second: "+n/((t1-t0)/1000.));
    }
}

I think you are putting the proverbial cart before the horse. 我认为您正在把众所周知的购物车摆在马匹前面。

  1. As Knuth said, premature optimization is the root of all evil. 正如Knuth所说,过早的优化是万恶之源。 Have you tried using the FileFilter method and found that it is too slow for the application? 您是否尝试过使用FileFilter方法并发现它对于应用程序来说太慢了?

  2. Why do you have so many files in one folder? 为什么一个文件夹中有这么多文件? Perhaps the more beneficial approach would be to split those files up in some manner instead of having them all in one folder. 也许更有益的方法是以某种方式拆分这些文件,而不是将它们全部放在一个文件夹中。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM