简体   繁体   English

如何在文件系统中找到循环?

[英]how to find a loop in the file system?

how to find a loop in the file system in Linux?如何在 Linux 的文件系统中找到循环? i am indexing all files for making search fast(O(1))... i am using c programming language to implement by using the library functions in dir.h.... I can scan through entire file system but it goes in a loop if there is loop in the filesystem(example loop mount)... how to find the loop in file system.. i have seen updatedb command reporting when there is loop in file system... i don't understand the logic... can anyone pls help find solution for this?我正在索引所有文件以使搜索快速(O(1))...我正在使用 c 编程语言通过使用 dir.h 中的库函数来实现....我可以扫描整个文件系统,但它进入如果文件系统中存在循环,则为循环(示例循环挂载)...如何在文件系统中找到循环..当文件系统中存在循环时,我已经看到 updatedb 命令报告...我不明白逻辑...任何人都可以帮助找到解决方案吗?

The general way to prevent re-scanning nodes in a graph is to mark nodes as you pass them, and then ignore nodes that are marked.防止重新扫描图中节点的一般方法是在传递节点时标记节点,然后忽略标记的节点。 This isn't terribly practical if you don't want to modify the graph you are scanning over, so you need a way to mark the nodes externally.如果您不想修改正在扫描的图形,这不是非常实用,因此您需要一种在外部标记节点的方法。 The easiest way I can think of to do this under linux would be to store a the device/inode for each directory you visit.我能想到的在 linux 下执行此操作的最简单方法是为您访问的每个目录存储一个设备/inode。 Then when you look at a directory, first check that you haven't already seen any directories with the same device/inode.然后,当您查看目录时,首先检查您是否还没有看到任何具有相同设备/inode 的目录。 This not only handles cycles, but also trees that merge back into each other.这不仅处理循环,还处理相互合并的树。

To get the device/inode number take a look at the stat/fstat functions and the st_dev and st_ino members of the stat struct.要获取设备/inode 编号,请查看 stat/fstat 函数以及 stat 结构的 st_dev 和 st_ino 成员。

For storing the data, you're probably wanting to look at a hash-table or a binary tree.为了存储数据,您可能希望查看哈希表或二叉树。

Btw.顺便提一句。 You don't need to search loop in the file system.您不需要在文件系统中搜索循环。

You are indexing whole disk.您正在索引整个磁盘。 So you don't need to follow symbolic links as each file must be accessible in normal way (without symlinks).因此,您不需要遵循符号链接,因为每个文件都必须能够以正常方式访问(没有符号链接)。 You just need to check points of mount if some disk is mounted more than one time just ignore the rest points of mount.如果某个磁盘被多次挂载,您只需要检查挂载点,只需忽略 rest 挂载点即可。

I found this interesting comment here about finding loops in a DAG :我在这里找到了关于 在 DAG 中查找循环的有趣评论:

Steinar H. Gunderson wrote: Steinar H. Gunderson 写道:

On Thu, 26 Feb 2004 00:28:32 +0100, Orlondow wrote: 2004 年 2 月 26 日星期四 00:28:32 +0100,Orlondow 写道:

...also reproduced in the Cormen-Leiserson-Rivest, IIC. ...也转载于 IIC 的 Cormen-Leiserson-Rivest。 Which is easiest to find.哪个最容易找到。

Yes, I actually have Cormen et al, but it never struck me to look up "strongly connected components" when I wanted cycle detection.是的,我实际上有 Cormen 等人,但是当我想要循环检测时,我从未想过查找“强连接组件”。 Thanks, I'll have a look at it.谢谢,我会看看它。 :-) :-)

To find a cycle in a directed graph (you don't care which cycle) as long as there exists one, you don't need to go overboard with SCC.要在有向图中找到一个循环(您不关心哪个循环),只要存在一个循环,您就不需要使用 SCC 过度使用 go。 Plain old depth first search DFS (in the same chapter of CLRS) is enough.普通的旧深度优先搜索 DFS(在 CLRS 的同一章中)就足够了。

So, roughly speaking, while you are traversing the directory tree, create a DAG which represents the structure of the tree with the data at the node refering to the inode of the file.因此,粗略地说,当您遍历目录树时,创建一个 DAG,它表示树的结构,其中节点处的数据引用文件的 inode。 Then, you just check to make sure you don't visit a node more than once.然后,您只需检查以确保您不会多次访问节点。

Perhaps I'm being a bit dim here but aren't the two ways you could create a cycle:也许我在这里有点昏暗,但不是您可以创建循环的两种方式:

  • by creating a symbolic link通过创建符号链接
  • by mounting something twice通过安装两次

To deal with these, you can get a list of the mounts before you start indexing and ifnore all but the first of the same fs, and you can ignore links as you come across them in the indexing process.为了解决这些问题,您可以在开始索引之前获取挂载列表,以及除第一个 fs 之外的所有挂载,并且您可以在索引过程中遇到链接时忽略它们。

Simple way.简单的方法。 Just do a depth-first tree-walk of the directory tree, keeping a stack of the nodes above you as you go.只需对目录树进行深度优先树遍历,在 go 时保持上面的节点堆栈。 At each node you visit, if that node is already in the stack, you have a cycle.在您访问的每个节点上,如果该节点已经在堆栈中,则您有一个循环。

 // here's a stack of nodes
node stack[1000];

walk(node, level){
    if (node in stack[0..level-1]) then there is a cycle
    else
        stack[level] = node
        for each subnode x of node
            walk(x, level+1)
}

As others have said, there is no such thing as a loop in a file system if you realize that the path is part of a file name, unless its a cyclical symbolic link.正如其他人所说,如果您意识到路径是文件名的一部分,则文件系统中没有循环之类的东西,除非它是循环符号链接。

For instance, if you bootstrap some distro (lets say Debian) to a loop device, or even to a directory, and do this ON a Debian machine, you have now duplicated a lot of things.例如,如果你将一些发行版(比如说 Debian)引导到一个循环设备,甚至是一个目录,然后在 Debian 机器上执行此操作,你现在已经复制了很多东西。

For instance, lets say you are running Debian Lenny, and bootstrap a minimal copy of it to /lenny.例如,假设您正在运行 Debian Lenny,并将它的最小副本引导到 /lenny。

/lenny/usr/* is going to be the same as /usr/*. /lenny/usr/* 将与 /usr/* 相同。 There is no 'cheap' way of avoiding this.没有“便宜”的方法可以避免这种情况。

Since you're already calling a stat() on each node (I'm assuming you're using ftw() / ftw64(), you may as well:由于您已经在每个节点上调用了 stat()(我假设您正在使用 ftw()/ftw64(),您也可以:

  • Have ftw()'s callback insert the name of the node into an array, that has structure members that can store a hash of the file which is unlikely to collide.让 ftw() 的回调将节点的名称插入到数组中,该数组具有可以存储不太可能发生冲突的文件的 hash 的结构成员。 md5 isn't going to cut it for this. md5 不会为此而削减它。
  • Update a hash table based on that digest and the file name (not path).根据该摘要和文件名(不是路径)更新 hash 表。

That's in no danger of speeding up your scan, but it will significantly cut down on time it takes to search.这不会加快您的扫描速度,但会大大减少搜索所需的时间。

If you use threads correctly and set affinity, the hashing and indexing can happen on one core while the other is i/o bound (when multiple cores are available).如果您正确使用线程并设置亲和性,则哈希和索引可以在一个核心上发生,而另一个是 i/o 绑定的(当多个核心可用时)。

However, 'just' checking for duplicate mounts is not going to be a cure, plus I'm sure your program would want to return the locations of all files named 'foo', even if there are four identical copies to mention.但是,“仅”检查重复的挂载并不能解决问题,而且我确信您的程序会想要返回所有名为“foo”的文件的位置,即使有四个相同的副本要提及。

This is more typically referred to as a "cycle."这更典型地称为“循环”。 So you want to implement "cycle detection."所以你要实现“循环检测”。 There are many ways to do this;有很多方法可以做到这一点; I don't know if this is for home work or not but one simple, not necessarily the most optimal, method is through pointer chasing.我不知道这是否适合家庭工作,但一种简单但不一定是最佳的方法是通过指针追踪。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM