简体   繁体   English

计算目录中大量文件的最快/最简单的方法是什么(在 Linux 中)?

[英]What is the fastest / easiest way to count large number of files in a directory (in Linux)?

I had some directory, with large number of files.我有一些目录,里面有大量的文件。 Every time I tried to access the list of files within it, I was not able to do that or there was significant delay.每次我尝试访问其中的文件列表时,我都无法访问或出现明显延迟。 I was trying to use ls command within command-line on Linux and web interface from my hosting provider did not help also.我试图在 Linux 的命令行中使用ls命令,我的托管服务提供商的 Web 界面也没有帮助。

The problem is, that when I just do ls , it takes significant amount of time to even start displaying something.问题是,当我只执行ls时,甚至开始显示某些内容都需要花费大量时间。 Thus, ls | wc -l因此, ls | wc -l ls | wc -l would not help also. ls | wc -l也无济于事。

After some research I came up with this code (in this example it counts number of new emails on some server):经过一番研究,我想出了这个代码(在这个例子中,它计算了某个服务器上的新电子邮件的数量):

print sum([len(files) for (root, dirs, files) in walk('/home/myname/Maildir/new')])

The above code is written in Python.上面的代码是用 Python 编写的。 I used Python's command-line tool and it worked pretty fast (returned result instantly).我使用了 Python 的命令行工具,它运行得非常快(立即返回结果)。

I am interested in the answer to the following question: is it possible to count files in a directory (without subdirectories) faster?我对以下问题的答案很感兴趣:是否可以更快地计算目录(没有子目录)中的文件数? What is the fastest way to do that?最快的方法是什么?

ls does a stat(2) call for every file. ls为每个文件执行stat(2)调用。 Other tools, like find(1) and the shell wildcard expansion, may avoid this call and just do readdir . 其他工具,如find(1)和shell通配符扩展,可以避免此调用,只需执行readdir One shell command combination that might work is find dir -maxdepth 1|wc -l , but it will gladly list the directory itself and miscount any filename with a newline in it. 可能有效的一个shell命令组合是find dir -maxdepth 1|wc -l ,但是它很乐意列出目录本身并错误计算任何带有换行符的文件名。

From Python, the straight forward way to get just these names is os.listdir(directory) . 从Python中,获取这些名称的直接方法是os.listdir(目录) Unlike os.walk and os.path.walk, it does not need to recurse, check file types, or make further Python function calls. 与os.walk和os.path.walk不同,它不需要递归,检查文件类型或进行进一步的Python函数调用。

Addendum: It seems ls doesn't always stat. 附录:似乎并不总是统计数据。 At least on my GNU system, it can do only a getdents call when further information (such as which names are directories) is not requested. 至少在我的GNU系统上,当没有请求进一步的信息(例如哪些名称是目录)时,它只能进行一个getdents调用。 getdents is the underlying system call used to implement readdir in GNU/Linux. getdents是用于在GNU / Linux中实现readdir的底层系统调用。

Addition 2: One reason for a delay before ls outputs results is that it sorts and tabulates. 增加2:在ls输出结果之前延迟的一个原因是它进行排序和制表。 ls -U1 may avoid this. ls -U1可以避免这种情况。

Total number of files in the given directory 给定目录中的文件总数

find . -maxdepth 1 -type f | wc -l

Total number of files in the given directory and all subdirectories under it 给定目录中的文件总数及其下的所有子目录

find . -type f | wc -l

For more details drop into a terminal and do man find 有关详细信息,请访问终端,然后man find

This should be pretty fast in Python: 这在Python中应该非常快:

from os import listdir
from os.path import isfile, join
directory = '/home/myname/Maildir/new'
print sum(1 for entry in listdir(directory) if isfile(join(directory,entry)))

我认为ls在显示第一行之前花费了大部分时间,因为它必须对条目进行排序,因此ls -U应该更快地显示第一行(尽管它总体上可能没有那么好)。

The fastest way would be to avoid all the overhead of interpreted languages and write some code that directly addresses your problem. 最快的方法是避免解释语言的所有开销,并编写一些直接解决问题的代码。 Doing so is difficult to do in a portable way, but pretty straightforward. 这样做很难以便携的方式进行,但非常简单。 At the moment I'm on an OS X box, but converting the following to Linux should be extremely straightforward. 目前我在OS X盒子上,但将以下内容转换为Linux应该非常简单。 (I opted to ignore hidden files and only count regular files...modify as necessary or add command line switches to get the functionality you want.) (我选择忽略隐藏文件,只计算常规文件...根据需要修改或添加命令行开关以获得所需的功能。)

#include <dirent.h>
#include <stdio.h>
#include <stdlib.h>

int
main( int argc, char **argv )
{
    DIR *d;
    struct dirent *f;
    int count = 0;
    char *path = argv[ 1 ];

    if( path == NULL ) {
        fprintf( stderr, "usage: %s path", argv[ 0 ]);
        exit( EXIT_FAILURE );
    }
    d = opendir( path );
    if( d == NULL ) { perror( path );exit( EXIT_FAILURE ); }
    while( ( f = readdir( d ) ) != NULL ) {
        if( f->d_name[ 0 ] != '.'  &&  f->d_type == DT_REG )
            count += 1;
    }
    printf( "%d\n", count );
    return EXIT_SUCCESS;
}

My use case is a linux SBC (Banana Pi) counting files in a directory on a FAT32 USB stick. 我的用例是一个linux SBC(Banana Pi)计数FAT32 USB记忆棒上的目录中的文件。 In a shell, doing 在一个shell中,做

ls -U {dir} | wc -l

takes 6.4secs with 32k files in there (32k = max files/dir on FAT32) From python doing 在那里需要6.4secs和32k文件(32k = FAT32上的最大文件/目录)从python做起

t=time.time() ; print len(os.listdir(d)) ; print time.time()-t

takes only 0.874secs(!) Can't see anything else in Python being quicker than that. 只需0.874secs(!)看不到Python中的任何其他内容比这更快。

I'm not sure about speed, but if you want to just use shell builtins this should work: 我不确定速度,但如果你想只使用shell内置,这应该工作:

#!/bin/sh
COUNT=0;
for file in /path/to/directory/*
do
COUNT=$(($COUNT+1));
done
echo $COUNT

A shorter way of counting files in a directory in bash:在 bash 中计算目录中文件的一种更短的方法:

files=(*) ;文件=(*); echo ${#files[@]}回声 ${#files[@]}

I generate 10_000 empty files in tmpfs;我在 tmpfs 中生成了 10_000 个空文件; it takes 0.03s on my machine to count them, running ls |在我的机器上计算它们需要 0.03 秒,运行 ls | wc-l was just slightly slower (I flushed the cache before and in between just in case) wc-l 只是稍微慢了一点(为了以防万一,我之前和中间都刷新了缓存)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 处理 large.asc 文件的最快方法是什么? - What is the fastest way to process large .asc files? 在 linux 上处理目录中的大量文件 - Dealing with large number of files in a directory on linux Python以最快的方式将大量小文件读入内存? - Python fastest way to read a large number of small files into memory? 在python中找到大数阶乘的确切值的最快方法是什么? - What is the fastest way to find the exact value of the factorial of a large number in python? 使用python将文本与大量正则表达式进行比较的最快方法是什么? - What is the fastest way to compare a text with a large number of regexp with python? 在 Python 中操作大型 csv 文件的最快方法是什么? - What is the fastest way to manipulate large csv files in Python? 比较两个电子表格文件并提取匹配数据的最简单,最快的方法是什么? - What is the easiest and fastest way to compare two spreadsheet files and extract maching matching data? 在大量文件中搜索大量单词的最佳方法是什么? - What is the best way to search for a large number of words in a large number of files? 初始化大量变量的最简单方法 - Easiest way to initialize a large number of variables 在python中对大量数组进行排序的最快方法 - Fastest way to sort a large number of arrays in python
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM