简体   繁体   English

删除不在数据表中的文件的最快方法?

[英]Fastest way to delete files that are not in a data table?

I need to write a code in C# that will select a list of file names from a data table and delete every file in a folder that is not in this list. 我需要用C#编写代码,该代码将从数据表中选择文件名列表,并删除不在此列表中的文件夹中的每个文件。

One possibility would be to have both ordered by name, and then loop through my table results, and for each result, loop through my files and delete them until I find a file that matches the current result or is alphabetically bigger, and then move to the next result without resetting the current file index. 一种可能是按名称进行排序,然后遍历表结果,对于每个结果,遍历我的文件并删除它们,直到找到与当前结果匹配或按字母顺序更大的文件,然后移至下一个结果而不重置当前文件索引。

I haven't tried to actually implement this, but seems to me that this would be an O(n) since each list would be looped through just once (ignoring the sorting both lists part). 我还没有尝试实际实现它,但是在我看来这将是O(n),因为每个列表仅循环一次(忽略对两个列表进行排序的部分)。 The only thing I'm not sure about is whether I can be 100% sure both the file system and the database engine will sort exactly the same way (will they both consider "_" smaller than "-" and stuff like that). 我唯一不确定的是,我是否可以100%确定文件系统和数据库引擎的排序方式完全相同(它们都将“ _”小于“-”之类的东西)。 If not, the algorithm above just wouldn't work at all. 如果没有,上述算法将根本无法工作。 (By the way this is a Jet Engine database.) (顺便说一下,这是一个Jet Engine数据库。)

But since this is probably not such an uncommon problem you guys might already know a better solution. 但是由于这可能不是一个罕见的问题,所以你们可能已经知道更好的解决方案。 I tried search the web but couldn't find anything. 我尝试在网上搜索,但找不到任何内容。 Perhaps a more effective solution would be to put each list into a HashSet and find their difference. 也许更有效的解决方案是将每个列表放入HashSet中并找出它们之间的差异。

  1. Get the folder content into folderFiles ( IEnumerable<string> ) 将文件夹内容放入folderFilesIEnumerable<string>
  2. Get the file you want to keep in filesToKeep ( IEnumerable<string> ) 获取要保留在filesToKeep的文件( IEnumerable<string>
  3. Get a list of "not in list" files. 获取“不在列表中”文件的列表。
  4. Delete these files. 删除这些文件。

Code Sample : 代码示例:

IEnumerable<FileInfo> folderFiles = new List<FileInfo>(); // Fill me.
IEnumerable<string> filesToKeep = new List<string>();     // Fill me.
foreach (string fileToDelete in folderFiles.Select(fi => fi.FullName).Except(filesToKeep))
{
    File.Delete(fileToDelete);
}

Here is my suggestion for you. 这是我对你的建议。 Assuming filesInDatabase contains a list of files which are in the database and pathOfDirectory contains the path of the directory where the files to compare are contained. 假设filesInDatabase包含数据库中文件的列表,而pathOfDirectory包含要比较的文件所在的目录的路径。

foreach (var fileToDelete in Directory.EnumerateFiles(pathOfDirectory).Where(item => !filesInDatabase.Contains(item))
{
    File.Delete(fileToDelete);
}

EDIT: 编辑:

This requires using System.Linq; 这需要using System.Linq; , because it uses LINQ. ,因为它使用LINQ。

I think hashing is the way to go, but you don't really need two HashSets. 我认为哈希是要走的路,但是您实际上并不需要两个HashSet。 Only one HashSet is needed to store the standardized file names from the datatable; 只需一个HashSet即可存储数据表中的标准化文件名; the other container can be any collection data type. 另一个容器可以是任何收集数据类型。

First off, .Net allows you to define cultures that can be used in sorting, but I'm not all that familiar with the mechanism, so I'll let Google to give his pointers on the subject. 首先,.Net允许您定义可用于排序的区域性,但是我对该机制并不十分了解,因此,我将让Google给出有关该主题的指导。

Second, to avoid all the culture mass, you can use a different algorithm with an idea similar to radix-sort (only without the sort) - time complexity is O(n * length_longest_file_name). 其次,要避免所有文化,可以使用一种与基数排序类似的思想(仅不进行排序)的不同算法-时间复杂度为O(n * length_longest_file_name)。 File name lengths are limited (as far as I know, almost no file system will allow a file name longer then 256), so I'm assuming that n is dramatically larger then file name lengths, and if n is smaller then the max file name length, just use an O(n^2) method and avoid the work (iterating lists this small is near instant times anyways). 文件名的长度是有限的(据我所知,几乎没有文件系统允许文件名长于256),所以我假设n大大大于文件名长度,如果n小于则最大文件数名称长度,只需使用O(n ^ 2)方法即可避免工作(无论如何,迭代列表的大小都接近即时)。 Note: This method does not require sorting. 注意:此方法不需要排序。

The idea is to create an array of symbols that can be used as file name chars (about 60-70 chars, if this is a case sensitive search), and another flag array with a flag for each char in the first array. 这个想法是创建一个符号数组,该符号数组可以用作文件名chars(如果区分大小写,则大约为60-70个chars),并创建另一个flag数组,其中第一个数组中的每个char都有一个标志。 Now, you create a loop for each char in the file names of the list from the DB (from 1 -> length_longest_file_name). 现在,您为数据库中列表文件名中的每个字符创建一个循环(从1-> length_longest_file_name)。 In each iteration (i) you go over the i-th char of each file name in the DB list. 在每个迭代(i)中,您将遍历数据库列表中每个文件名的第i个字符。 Every char you see, you set it's relevant flag to true. 您看到的每个字符,都将其相关标志设置为true。 When all flags are set, you go over the second list and delete every file for which the i-th char of it's name is not flagged. 设置所有标志后,您将越过第二个列表并删除未标记其名称的第i个字符的每个文件。

Implementation might be complex, and the overhead of the two arrays might make it slower when n is small, but you can optimize this to make it better (for instance, no iterating over files that have names shorter then the current i by removing them from both lists). 实现可能很复杂,并且当n较小时,两个数组的开销可能会使速度变慢,但是您可以对其进行优化以使其更好(例如,不要通过删除名称比当前i短的文件来进行迭代这两个列表)。

Hope this helps 希望这可以帮助

I have another idea that might be faster. 我有另一个想法可能会更快。

var filesToDelete = new List<string>(Directory.GetFiles(directoryPath));
foreach (var databaseFile in databaseFileList)
{
    filesToDelete.Remove(databaseFile);
}
foreach (var fileToDelete in filesToDelete)
{
    File.Delete(fileToDelete);
}

Explanation: First get all files containing in the directory. 说明:首先获取目录中包含的所有文件。 Then delete every file from that list, which is in the database. 然后从该列表中删除数据库中的每个文件。 At last delete all remaining files from the list filesToDelete. 最后,从列表filesToDelete中删除所有剩余的文件。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM