简体   繁体   English

加快在多个文件夹中查找GNU的速度

[英]speeding up GNU find on several folders

On a Linux 64bit CentOS server, I am running a GNU find command on several folders, each of them containing a similar subfolder structure. 在Linux 64位CentOS服务器上,我在几个文件夹上运行GNU find命令,每个文件夹都包含类似的子文件夹结构。 The structure is: 结构为:

/my/group/folder/project_123/project_123-12345678/*/*file_pattern_at_this_level*
/my/group/folder/project_234/project_234-23456789/*/*file_pattern_at_this_level*

The folder asterisk /*/ is to indicate that there are a bunch of subfolders inside each project folder, of varying names. 文件夹星号/*/表示每个项目文件夹中都有许多不同名称的子文件夹。

I have tried adding the final asterisk and then limiting the find command to a certain -mindepth N and -maxdepth N : 我尝试添加最后一个星号,然后将find命令限制为某个-mindepth N-maxdepth N

find $folder1 $folder2 $folder3 -mindepth 1 -maxdepth 1 -name "*file_pattern*"

But the tests are on a server node that has other running jobs, so it's difficult to get a fair performance comparison, also mainly due to some level of caching taking place after the first command, which makes the first type of command slow and the second equivalent type faster. 但是测试是在具有其他正在运行的作业的服务器节点上进行的,因此很难进行合理的性能比较,这也主要是由于在第一个命令之后进行了一定程度的缓存,这使得第一种类型的命令变慢而第二种类型的命令变慢了。等效类型更快。

This is a multicore node, so what else could I try to make this type of commands faster? 这是一个多核节点,那么我还能尝试什么来使这种类型的命令更快?

"Actually commands like find and grep are almost always IO-bound: the disk is the bottleneck, not the CPU. In such cases, if you run several instances in parallel, they will compete for I/O bandwidth and cache, and so they will be slower." “实际上,诸如find和grep之类的命令几乎总是与IO绑定:磁盘是瓶颈,而不是CPU。在这种情况下,如果并行运行多个实例,它们将争夺I / O带宽和缓存,因此会慢一些。” - https://unix.stackexchange.com/a/111409 -https://unix.stackexchange.com/a/111409

Don't worry about "finding" the files, worry about what you need to do with them. 不必担心“查找”文件,不必担心需要使用它们。 For that you can parallelize with "parallel" or "xargs". 为此,您可以使用“ parallel”或“ xargs”进行并行化。

If you still want to pursue that, you could still try to use "parallel" together with find, passing a list of directories. 如果您仍然想要这样做,您仍然可以尝试将“并行”与find一起使用,并传递目录列表。 That will cause parallel to spawn a bunch of find processes (-j option sets how many "threads" will be running simultaneously) to process the "queue". 这将导致并行产生一堆查找进程(-j选项设置将同时运行多少个“线程”)来处理“队列”。 In this scenario you will be needing to set std out to a file, so you could review the output later, or not, depending on your use. 在这种情况下,您将需要将std设置为文件,以便稍后根据使用情况查看输出,也可以不查看输出。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM