简体   繁体   English

寻找一种有效的方法来检查Windows上的文件是否存在以及SAN上的文件

[英]Looking for an efficient way to check for file existence on Windows with files on a SAN

I have a large set of files located across a series of directories on a windows 2003 server. 我在Windows 2003服务器上的一系列目录中有大量文件。 There are upwards of a million files in each directory. 每个目录中有上百万个文件。 The Windows server uses iSCSI to connect to an Equalogics SAN. Windows服务器使用iSCSI连接到Equalogics SAN。

I have an application that needs to determine if a set of files exists - the application needs to check for the existence of up to a million files per directory. 我有一个需要确定是否存在文件集的应用程序-该应用程序需要检查每个目录是否存在一百万个文件。

I have tried a variety of techniques / scripting languages including perl, vbscript, dos batch files and I can not obtain greater than about 250 files checks per second. 我尝试了多种技术/脚本语言,包括perl,vbscript,dos批处理文件,但我每秒获取的文件检查次数不能超过250个。 This works out to almost 50 minutes to check for 800,000 files. 这大约需要50分钟才能检查800,000个文件。 I tried multithreading a perl program to check for multiple files at a time, but this did not help. 我尝试对perl程序进行多线程处理以一次检查多个文件,但这无济于事。

I have also tried to list all of the files in the directory using dir, ls, find (via cygwin), and it takes many minutes for it to start outputting any file names at all. 我还尝试使用dir,ls,find(通过cygwin)列出目录中的所有文件,并且它花了很长时间才能开始输出所有文件名。 This isn't a great approach anyway, because there are more files than I actually need to check for. 无论如何,这不是一个好方法,因为文件数超出了我实际需要检查的数量。

Is there a way I can force windows to do a "read ahead" on the directory, and get the files into a cache? 有没有一种方法可以强制Windows对目录进行“预读”,并将文件放入缓存?

Is there a better way to approach this find of a problem? 有没有更好的方法来解决这个问题?

I would probably avoid any interpreted language such as VBScript et al for precisely the reasons you've specified - just not going to work as well in a scenario where performance is an issue. 正是出于您指定的原因,我可能会避免使用任何解释性语言,例如VBScript等-只是在性能成为问题的情况下不能很好地工作。

Now, as my formal caveat for my suggestion, I'm assuming that over the expected time such an application would run that the set of propsective files (the search target) remains relatively stable such that the risk of a false positive presence check from the application due to file set changes occuring after the scanning application started is minimal. 现在,作为我对我的建议的正式警告,我假设在预期的时间内,这样的应用程序将运行,即该保护文件集(搜索目标)保持相对稳定,从而有可能从该文件中错误地确认存在。由于启动扫描应用程序后发生文件集更改而导致的应用程序最小。

It's not elegant, but I would at least suggest exploring a Win32 (not .NET) console-type app that recursively searches the directory tree into a memory-mapped file, then search that file for your required pattern. 这不是很优雅,但是我至少建议您探索一个Win32(不是.NET)控制台类型的应用程序,该应用程序将目录树递归地搜索到内存映射文件中,然后在该文件中搜索所需的模式。 That limits the disk access to just the effort required to accumulate the results, and then puts the searching against the presumably (much) faster memory-backed file. 这就限制了磁盘访问,而只需要进行累加结果所需的工作,然后将搜索放在可能(更快)的内存支持文件上。 Now, I may be underestimating the size and/or complexity of your fileset contents, but that's what I would offer as a starting point. 现在,我可能低估了文件集内容的大小和/或复杂性,但是我将以此为起点。

I recommend a Win32 app over a .NET app to avoid the overhead of the framework runtime, but the obvious caveats about a non-managed app apply. 我建议在.NET应用程序上使用Win32应用程序,以避免框架运行时的开销,但是关于非托管应用程序的明显警告适用。

Hope that's helpful, or at least stirs the pot for you a bit. 希望对您有所帮助,或者至少可以帮您搅拌一下锅。 Good luck. 祝好运。

When you check each file individually you're limited by the latency of the request and response. 当您分别检查每个文件时,您会受到请求和响应延迟的限制。 It's doubtful you can find a way to speed that up unless you use asynchronous requests and run many simultaneously, but that approach will put a strain on the file system. 除非您使用异步请求并同时运行许多请求,否则您是否会找到一种加快速度的方法,这令人怀疑,但是这种方法会对文件系统造成压力。

While getting a full directory listing seems like overkill, it's likely to be the fastest method unless your search list is much smaller (say 100 times smaller) than the full directory. 虽然得到一个完整的目录列表,似乎有点小题大做,很可能是最快的方法,除非你的搜索列表小得多 (说小100倍)的完整目录。

Each individual check requires the operating system to read through the directory until it finds (or fails to find) the file you're asking for. 每项检查都要求操作系统通读目录,直到它找到(或找不到)您要的文件。 In other words, each check reads on average more than half of the contents of the directory, so reading the complete directory once will almost certainly be much more efficient. 换句话说,每个检查平均读取目录内容的一半以上,因此一次读取整个目录几乎肯定会更有效率。

However, you shouldn't do this by spawning out to another program. 但是,您不应通过生成其他程序来执行此操作。 Use FindFirstFile/FindNextFile or a .NET equivalent. 使用FindFirstFile / FindNextFile或等效的.NET。 You can check each file against your list as you find it - you might want to organize your list first, put it in a b-tree or something. 您可以在找到列表时对照列表检查每个文件-您可能需要先整理列表,然后将其放在b树中。

You might want to try GetFileInformationByHandleEx with the FileIdBothDirectoryInfo option instead of FindFirstFile/FindNextFile to see which is faster. 您可能要尝试使用带有FileIdBothDirectoryInfo选项的GetFileInformationByHandleEx而不是FindFirstFile / FindNextFile来查看哪个更快。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM