简体   繁体   English

使用opendir(),readdir()和closedir()高效遍历目录树

[英]Efficiently Traverse Directory Tree with opendir(), readdir() and closedir()

The C routines opendir(), readdir() and closedir() provide a way for me to traverse a directory structure. C例程opendir(),readdir()和closedir()为我提供了遍历目录结构的方法。 However, each dirent structure returned by readdir() does not seem to provide a useful way for me to obtain the set of pointers to DIR that I would need to recurse into the directory subdirectories. 但是,readdir()返回的每个dirent结构似乎都没有为我提供一个有用的方法来获取DIR的指针集,我需要将它们递归到目录子目录中。

Of course, they give me the name of the files, so I could either append that name to the directory path and stat() and opendir() them, or I could change the current working directory of the process via chdir() and roll it back via chdir(".."). 当然,它们给我文件的名称,所以我可以将该名称附加到目录路径和stat()和opendir()它们,或者我可以通过chdir()和roll更改进程的当前工作目录它通过chdir(“..”)返回。

The problem with the first approach is that if the length of the directory path is great enough, then the cost to pass a string containing it to opendir() will overweight the cost of opening a directory. 第一种方法的问题是,如果目录路径的长度足够大,那么将包含它的字符串传递给opendir()的成本将超过打开目录的成本。 If you are a bit more theoretical, you could say your complexity could increase beyond linear time (in the total character count of the (relative) filenames in the directory tree). 如果你有点理论上的话,可以说你的复杂性可能超过线性时间(在目录树中(相对)文件名的总字符数)。

Also, the second approach has a problem. 而且,第二种方法存在问题。 Since each process has a single current working directory, all but one thread will have to block in a multithreaded application. 由于每个进程都有一个当前工作目录,因此除了一个线程之外的所有进程都必须在多线程应用程序中进行阻塞。 Also, I don't know if the current working directory is just a mere convenience (ie, the relative path will be appended to it prior to a filesystem query). 另外,我不知道当前的工作目录是否仅仅是方便(即,在文件系统查询之前将相对路径附加到它)。 If it is, this approach will be inefficient too. 如果是这样,这种方法也会效率低下。

I am accepting alternatives to these functions. 我接受这些功能的替代品。 So how is it one can traverse a UNIX directory tree efficiently (linear time in the total character count of the files under it)? 那么如何有效地遍历UNIX目录树(其下的文件总字符数的线性时间)?

Have you tried ftw() aka File Tree Walk ? 你试过ftw()又名File Tree Walk吗?

Snippit from man 3 ftw : 来自man 3 ftw英尺的Snippit:

int ftw(const char *dir, int (*fn)(const char *file, const struct stat *sb, int flag), int nopenfd);

ftw() walks through the directory tree starting from the indicated directory dir. ftw()从指示的目录dir开始遍历目录树。 For each found entry in the tree, it calls fn() with the full pathname of the entry, a pointer to the stat(2) structure for the entry and an int flag 对于树中的每个找到的条目,它使用条目的完整路径名调用fn(),指向条目的stat(2)结构的指针和int标志

You seem to be missing one basic point: directory traversal involves reading data from the disk. 您似乎缺少一个基本点:目录遍历涉及从磁盘读取数据。 Even when/if that data is in the cache, you end up going through a fair amount of code to get it from the cache into your process. 即使/如果该数据在缓存中,您最终也会通过相当数量的代码将缓存中的数据导入您的流程。 Paths are also generally pretty short -- any more than a couple hundred bytes is pretty unusual. 路径通常也很短 - 任何超过几百个字节都是非常不寻常的。 Together these mean that you can pretty reasonably build up strings for all the paths you need without any real problem. 这些意味着您可以非常合理地为所需的所有路径构建字符串,而不会出现任何实际问题。 The time spent building the strings is still pretty minor compared to the time to read data from the disk. 与从磁盘读取数据的时间相比,构建字符串所花费的时间仍然很少。 That means you can normally ignore the time spent on string manipulation, and work exclusively at optimizing disk usage. 这意味着您通常可以忽略在字符串操作上花费的时间,并专门用于优化磁盘使用。

My own experience has been that for most directory traversal a breadth-first search is usually preferable -- as you're traversing the current directory, put the full paths to all sub-directories in something like a priority queue. 我自己的经验是,对于大多数目录遍历,广度优先搜索通常是可取的 - 当您遍历当前目录时,将所有子目录的完整路径放在类似优先级队列的内容中。 When you're finished traversing the current directory, pull the first item from the queue and traverse it, continuing until the queue is empty. 遍历当前目录后,从队列中拉出第一个项目并遍历它,继续直到队列为空。 This generally improves cache locality, so it reduces the amount of time spent reading the disk. 这通常可以改善缓存局部性,因此可以减少读取磁盘所花费的时间。 Depending on the system (disk speed vs. CPU speed, total memory available, etc.) it's nearly always at least as fast as a depth-first traversal, and can easily be up to twice as fast (or so). 根据系统(磁盘速度与CPU速度,可用总内存等),它几乎总是至少与深度优先遍历一样快,并且可以轻松地达到两倍(或左右)。

The way to use opendir / readdir / closedir is to make the function recursive! 使用opendir / readdir / closedir是使函数递归! Have a look at the snippet here on Dreamincode.net . 看看Dreamincode.net上的片段。

Hope this helps. 希望这可以帮助。

EDIT Thanks R.Sahu, the linky has expired, however, found it via wayback archive and took the liberty to add it to gist . 编辑感谢R.Sahu,这个连环已经过期了,然而,通过回归档案找到了它,并冒昧地将它添加到gist中 Please remember, to check the license accordingly and attribute the original author for the source! 请记住,相应地检查许可证并将原始作者归因于源! :) :)

Probably overkill for your application, but here's a library designed to traverse a directory tree with hundreds of millions of files. 对于您的应用程序可能有点过分,但这里有一个库,用于遍历包含数亿个文件的目录树。

https://github.com/hpc/libcircle https://github.com/hpc/libcircle

Instead of opendir() , you can use a combination of openat() , dirfd() and fdopendir() and construct a recursive function to walk a directory tree: 您可以使用openat()dirfd()fdopendir()的组合而不是opendir() ,并构造一个递归函数来遍历目录树:

#include <stdio.h>
#include <string.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <unistd.h>
#include <dirent.h>

void
dir_recurse (DIR *parent, int level)
{
    struct dirent *ent;
    DIR *child;
    int fd;

    while ((ent = readdir(parent)) != NULL) {
        if ((strcmp(ent->d_name, ".") == 0) ||
            (strcmp(ent->d_name, "..") == 0)) {
            continue;
        }
        if (ent->d_type == DT_DIR) {
            printf("%*s%s/\n", level, "", ent->d_name);
            fd = openat(dirfd(parent), ent->d_name, O_RDONLY | O_DIRECTORY);
            if (fd != -1) {
                child = fdopendir(fd);
                dir_recurse(child, level + 1);
                closedir(child);
            } else {
                perror("open");
            }
        } else {
            printf("%*s%s\n", level, "", ent->d_name);
        }
    }
}

int
main (int argc, char *argv)
{
    DIR *root;

    root = opendir(".");
    dir_recurse(root, 0);
    closedir(root);

    return 0;
}

Here readdir() is still used to get the next directory entry. 这里readdir()仍然用于获取下一个目录条目。 If the next entry is a directory, then we find the parent directory fd with dirfd() and pass this, along with the child directory name to openat() . 如果下一个条目是目录,那么我们使用dirfd()找到父目录fd, dirfd()与子目录名一起传递给openat() The resulting fd refers to the child directory. 生成的fd指的是子目录。 This is passed to fdopendir() which returns a DIR * pointer for the child directory, which can then be passed to our dir_recurse() where it again will be valid for use with readdir() calls. 这将传递给fdopendir() ,它返回fdopendir()DIR *指针,然后可以将其传递给我们的dir_recurse() ,它再次对readdir()调用有效。

This program recurses over the whole directory tree rooted at . 该程序在整个目录树上进行递归. . Entries are printed, indented by 1 space per directory level. 打印条目,每个目录级别缩进1个空格。 Directories are printed with a trailing / . 目录印有尾随/

On ideone . 关于意识形态

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM