简体   繁体   English

返回字典 <FileHash, string[]> 从Linq查询

[英]Returning Dictionary<FileHash, string[]> from Linq Query

Thanks in advance for any assistance. 在此先感谢您的协助。 I'm not even sure if this is possible, but I'm trying to get a list of duplicate files using their hashes to identify the list of files associated with the hashes. 我什至不确定这是否可行,但是我试图使用其哈希值来获取重复文件的列表,以标识与哈希值关联的文件列表。

I have this below: 我下面有这个:

Dictionary<FileHash, string[]> FindDuplicateFiles(string searchFolder)
{
    Directory.GetFiles(searchFolder, "*.*")
        .Select(
            f => new
                     {
                         FileName = f,
                         FileHash = Encoding.UTF8.GetString(new SHA1Managed()
                                                                .ComputeHash(new FileStream(f,
                                                                                            FileMode.
                                                                                                OpenOrCreate,
                                                                                            FileAccess.Read)))
                     })
        .GroupBy(f => f.FileHash)
        .Select(g => new
                         {
                             FileHash = g.Key,
                             Files = g.Select(z => z.FileName).ToList()
                         })
        .GroupBy(f => f.FileHash)
        .Select(g => new {FileHash = g.Key, Files = g.Select(z => z.Files).ToArray()});

It compiles fine, but I'm just curious whether there's even a way to manipulate the results to return a Dictionary. 它可以很好地编译,但是我很好奇是否还有一种方法可以操纵结果以返回字典。

Any suggestions, alternatives, critiques would be greatly appreciated. 任何建议,替代方案,批评将不胜感激。

Create an extension method to IEnumerable<_> called toDictionary which converts a sequence of key value pairs to dictionary. 创建一个名为toDictionary的IEnumerable <_>扩展方法,该方法将一系列键值对转换为字典。 Might raise exception on duplicate keys. 可能在重复键上引发异常。

Why do you need the second GroupBy? 为什么需要第二个GroupBy?

You can use Enumerable.ToDictionary to collect a LINQ query into a dictionary: 您可以使用Enumerable.ToDictionary将LINQ查询收集到字典中:

var sha1 = new SHA1Managed();

Dictionary<string, string[]> result =
    Directory
        .EnumerateFiles(searchFolder)
        .GroupBy(file => Convert.ToBase64String(sha1.ComputeHash(...)))
        .ToDictionary(g => g.Key, g => g.ToArray());

Some remarks: 一些说明:

  • Don't assume that a random byte sequence (such as a SHA-1 hash) is a valid UTF-8 string. 不要假定随机字节序列(例如SHA-1哈希)是有效的UTF-8字符串。
  • Consider using Directory.EnumerateFiles instead of Directory.GetFiles. 考虑使用Directory.EnumerateFiles代替Directory.GetFiles。
  • Don't forget to close the FileStream after computing the SHA-1 hash. 计算SHA-1哈希值后,不要忘记关闭FileStream。
  • Afaik it's possible to reuse a SHA1Managed, so you don't need to create a new one for each file. Afaik可以重用SHA1Managed,因此您无需为每个文件创建一个新文件。

There's already an extension method which will do this. 已经有一个扩展方法可以做到这一点。 Just stick this at the end of your existing query: 只需将其放在现有查询的末尾即可:

.ToDictionary(x => x.FileHash, x => x.Files);

However: using Encoding.UTF8.GetString to convert arbitrary binary data into a string is a really bad idea. 但是,使用Encoding.UTF8.GetString任意二进制数据转换为字符串是一个非常糟糕的主意。 Use Convert.ToBase64String instead. 请改用Convert.ToBase64String The hash is not a UTF-8 encoded string, so don't treat it as one. 哈希不是 UTF-8编码的字符串,因此请勿将其视为一个。

You're also grouping by hash twice, which I suspect isn't really what you want to do. 您还将按哈希分组两次,我怀疑这并不是您真正想做的。

Alternatively, remove the previous GroupBy calls and use a Lookup instead: 或者,删除以前的GroupBy调用并改用Lookup

var query = Directory.GetFiles(searchFolder, "*.*")
                     .Select(f => new {
                         FileName = f,
                         FileHash = Convert.ToBase64String(
                             new SHA1Managed().ComputeHash(...))
                        })
                     .ToLookup(x => x.FileHash, x => x.FileName);

That will give you a Lookup<string, string> , which is basically the files grouped by hash. 这将为您提供Lookup<string, string> ,基本上是按哈希分组的文件。

One further thing to note: I suspect you'll be leaving file streams open with this method. 需要注意的另一件事:我怀疑您将使用这种方法打开文件流。 I suggest you write a small separate method to compute the hash of a file based on its name, but making sure you close the stream (with a using statement in the normal way). 我建议您编写一个单独的小方法来根据文件名计算文件的哈希值,但要确保关闭流( using常规方式的using语句)。 This will also end up making your query simpler - something along the lines of: 这也将最终使您的查询更简单-类似以下内容:

var query = Directory.GetFiles(searchFolder)
                     .ToLookup(x => ComputeHash(x));

It's hard to simplify it much further than that :) 很难进一步简化它了:)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM