为什么从目录加载文件名需要这么长时间？

Question

I need to load 1460 files into a list, from a folder with 163.360 files. 我需要从包含163.360文件的文件夹中将1460个文件加载到列表中。

I use the following python code to do this: 我使用以下python代码执行此操作：

import os
import glob

Directory = 'C:\\Users\\Nicolai\\Desktop\\sealev\\dkss_all'
stationName = '20002'
filenames = glob.glob("dkss."+stationName+"*")

This has been running fine so far, but today when I booted my machine and ran the code it was just stuck on the last line. 到目前为止，它运行良好，但是今天，当我启动计算机并运行代码时，它只停留在最后一行。 I tried to reboot, and it didn't help, in the end I just let it run, went to lunch break, came back and it was finished. 我试图重新启动，但并没有帮助，最后我只是让它运行了，去了午休，回来并完成了。 It took 45 minutes. 花了45分钟。 Now when I run it it takes less than a second, what is going on? 现在，当我运行它时，不到一秒钟，这是怎么回事？ Is this a cache thing? 这是缓存的东西吗？ How can I prevent having to wait 45 minutes again? 如何避免再次等待45分钟？ Any explanations would be much appreciated. 任何解释将不胜感激。

Answer 1

Yes, it is a caching thing. 是的，这是一个缓存的东西。 Your harddisk is a slow peripheral, reading 163.360 filenames from it can take some time. 您的硬盘速度很慢，从中读取163.360文件名可能需要一些时间。 Yes, your operating system caches that kind of information for you. 是的，您的操作系统会为您缓存此类信息。 Python has to wait for that information to be loaded before it can filter out the matching filenames. Python必须等待该信息被加载，然后才能过滤出匹配的文件名。

You don't have to wait all that time again until your operating system decides to use the memory caching the directory information for something else, or you restart the computer. 您不必再次等待所有时间，直到您的操作系统决定使用内存将目录信息缓存到其他地方，或者重新启动计算机。 Since you rebooted your computer, the information was no longer cached. 由于您重新启动计算机，因此不再缓存信息。

Answer 2

Presuming that ls on that same directory is just as slow, you can't reduce the total time needed for the directory listing operation. 假设同一个目录上的ls同样慢，则无法减少目录列表操作所需的总时间。 Filesystems are slow sometimes (which is why, yes, the operating system does cache directory entries). 文件系统有时很慢（这就是为什么，是的，操作系统确实缓存目录条目）。

However, there actually is something you can do in your Python code: You can operate on filenames as they come in, rather than waiting for the entire result to finish before the rest of your code even starts. 但是，实际上是你可以在你的Python代码做：您可以在文件名进行操作，因为他们进来，而不是等待整个结果你的代码的其余部分甚至开始之前完成。 Unfortunately, this is functionality not present in the standard library, meaning you need to call C functions. 不幸的是，这是标准库中没有的功能，这意味着您需要调用C函数。

See Ben Hoyt's scandir module for an implementation of this. 有关此实现的信息，请参见Ben Hoyt的scandir模块。 See also this StackOverflow question, describing the problem . 另请参见此StackOverflow问题，以描述问题。

Using scandir might look something like the following: 使用scandir可能类似于以下内容：

prefix = 'dkss.%s.' % stationName
for direntry in scandir(path='.'):
  if direntry.name.startswith(prefix):
    pass # do whatever work you want with this file here.

为什么从目录加载文件名需要这么长时间？

问题描述

2 个解决方案

解决方案1
3 2015-03-11 12:05:55

解决方案2
2 已采纳 2015-03-11 12:22:46

为什么从目录加载文件名需要这么长时间？

问题描述

2 个解决方案

解决方案1 3 2015-03-11 12:05:55

解决方案2 2 已采纳 2015-03-11 12:22:46

解决方案1
3 2015-03-11 12:05:55

解决方案2
2 已采纳 2015-03-11 12:22:46