简体   繁体   English

从一个目录中读取许多小文件有多大问题?

[英]How problematic is it to read many small files from one directory?

I have to read many (up to 5 mio.) small (9 KB) files. 我必须阅读许多(最多5 mio。)小(9 KB)文件。 At the moment they are all in one directory. 目前,他们都在一个目录中。 I fear this will take quadratic time or even n^2 log n for look up, is this right? 我担心这会花费二次时间甚至n ^ 2 log n来查找,这是对的吗? Is this significant (will the lookup take more time than the actual reading)? 这是否重要(查找是否需要比实际读数更多的时间)? Is there a difference in the asymptotic behavior of the running time when the file are cached by the OS? 操作系统缓存文件时,运行时间的渐近行为是否存在差异?

I use C++-streams for reading the files. 我使用C ++ - 流来读取文件。 At the moment I'm using Windows 7 with NTFS, but I will later run the program on a linux cluster (not sure which file system). 目前我正在使用带有NTFS的Windows 7,但我稍后会在linux集群上运行该程序(不确定是哪个文件系统)。

It might not be that bad : if you enumerate files, and process each filename as you encounter it, your OS is quite likely to have the directory entry in its disk cache. 它可能没有那么糟糕:如果你枚举文件,并在遇到文件时处理每个文件名,你的操作系统很可能在其磁盘缓存中有目录条目。 And for practical purposes, a disk cache is O(1). 并且出于实际目的,磁盘高速缓存是O(1)。

What will kill you is a mechanical HDD. 什么杀了你是一个机械硬盘。 You'll have 5 million disk seeks, each of which takes ~1/100th of a second. 你将有500万次磁盘搜索,每次搜索需要大约1/100秒。 That is 50.000 seconds, more than a half day. 那是50.000秒,超过半天。 This is a task that screams for an SSD. 这是一个尖叫SSD的任务。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM