在具有子目录的目录中以更快的方式查找文件列表

Question

There is a list of files(27000 in number). 有一个文件列表（数量为27000）。 The objective is to search each of these files in a directory structure(which has multiple levels of sub-directories) and print the missing files. 目的是在目录结构（具有多个子目录级别）中搜索每个文件，并打印丢失的文件。 I have code with recursive function to search for the presence of the file. 我有具有递归功能的代码来搜索文件的存在。 The code seems to be work but it is very slow for this particular scenario when the number of files to be searched is very high. 该代码似乎可以工作，但是对于这种特殊情况，当要搜索的文件数量非常多时，它的运行速度非常慢。 Is there is anyway to increase the performance of this code. 无论如何，有没有增加此代码的性能。

Code snippet is below: 代码段如下：

public static boolean walk(String path, String fileName) throws Exception {

    File root = new File(path);
    File[] list = root.listFiles();

    if (list == null)
        return false;

    for (File f : list) {
        if (f.isDirectory()) {
            walk(f.getAbsolutePath(), fileName);
        } else {
            if (f.getAbsoluteFile().getName().equalsIgnoreCase(fileName)) {
                presentFiles.add(f.getAbsoluteFile().getName());
                throw new Exception("hi");
            }
        }
    }
    return false;
}



public static void main(String[] args) {

    int i = 0;

    for (String fileName : attrSet) {//attrSet is HashSet of all the files which are being searched.
        try{
        boolean isFileFound = walk(source, fileName);
        }
        catch(Exception e) {
            System.out.println(e.getMessage() + i++);
        }
    }

    attrSet.removeAll(presentFiles); //presentFiles is HashSet of all files present in the directory

    for (String fileNm : attrSet) {
        System.out.println("FileName : " + fileNm);
    }

}

Answer 1

As already mentioned in a comment, turn the process around: 正如评论中已经提到的那样，将过程转过来：

Put the file names in the list into a hash set 将列表中的文件名放入哈希集中
recursively traverse the directory structure once and while doing so remove all found files from the hash set 递归遍历目录结构一次，同时这样做从哈希集中删除所有找到的文件
the hash set now contains only the missing files. 哈希集现在仅包含丢失的文件。

This should take approximately the same time you need now for testing one file (if we don't take into account disk caching). 这应该大约需要与现在测试一个文件所需的时间相同（如果我们不考虑磁盘缓存）。 The speedup is therefore almost a factor of 27000. 因此，加速几乎是27000倍。

在具有子目录的目录中以更快的方式查找文件列表

问题描述

1 个解决方案

解决方案1
1 已采纳 2016-08-31 05:18:40

在具有子目录的目录中以更快的方式查找文件列表

问题描述

1 个解决方案

解决方案1 1 已采纳 2016-08-31 05:18:40

解决方案1
1 已采纳 2016-08-31 05:18:40