简体   繁体   English

获取文件 SHA256 哈希码和校验和

[英]Get a file SHA256 Hash code and Checksum

Previously I asked a question about combining SHA1+MD5 but after that I understand calculating SHA1 and then MD5 of a lagrge file is not that faster than SHA256.之前我问了一个关于结合 SHA1+MD5 的问题,但在那之后我明白计算 SHA1 然后计算一个大文件的 MD5 并不比 SHA256 快。 In my case a 4.6 GB file takes about 10 mins with the default implementation SHA256 with (C# MONO) in a Linux system.在我的情况下,一个 4.6 GB 的文件在 Linux 系统中使用(C# MONO)的默认实现 SHA256 需要大约 10 分钟。

public static string GetChecksum(string file)
{
    using (FileStream stream = File.OpenRead(file))
    {
        var sha = new SHA256Managed();
        byte[] checksum = sha.ComputeHash(stream);
        return BitConverter.ToString(checksum).Replace("-", String.Empty);
    }
}

Then I read this topic and somehow change my code according what they said to :然后我阅读了这个主题,并根据他们所说的以某种方式更改了我的代码:

public static string GetChecksumBuffered(Stream stream)
{
    using (var bufferedStream = new BufferedStream(stream, 1024 * 32))
    {
        var sha = new SHA256Managed();
        byte[] checksum = sha.ComputeHash(bufferedStream);
        return BitConverter.ToString(checksum).Replace("-", String.Empty);
    }
}

But It doesn't have such a affection and takes about 9 mins.但它没有这样的感情,大约需要9分钟。

Then I try to test my file through sha256sum command in Linux for the same file and It takes about 28 secs and both the above code and Linux command give the same result !然后我尝试在 Linux 中通过sha256sum命令测试我的文件,它需要大约 28 秒,上面的代码和 Linux 命令都给出了相同的结果!

Someone advised me to read about differences between Hash Code and Checksum and I reach to this topic that explains the differences.有人建议我阅读散列码和校验和之间的差异,然后我谈到了解释这些差异的主题

My Questions are :我的问题是:

  1. What causes such different between the above code and Linux sha256sum in time ?是什么导致上述代码与Linux sha256sum的时间差异如此之大?

  2. What does the above code do ?上面的代码有什么作用? (I mean is it the hash code calculation or checksum calculation? Because if you search about give a hash code of a file and checksum of a file in C#, they both reach to the above code.) (我的意思是它是哈希码计算还是校验和计算?因为如果你在 C# 中搜索给出一个文件的哈希码和一个文件的校验和,它们都会到达上面的代码。)

  3. Is there any motivated attack against sha256sum even when SHA256 is collision resistant ?即使 SHA256 是抗碰撞的,是否有任何针对sha256sum主动攻击?

  4. How can I make my implementation as fast as sha256sum in C#?我怎样才能让我的实现像 C# 中的sha256sum一样快?

public string SHA256CheckSum(string filePath)
{
    using (SHA256 SHA256 = SHA256Managed.Create())
    {
        using (FileStream fileStream = File.OpenRead(filePath))
            return Convert.ToBase64String(SHA256.ComputeHash(fileStream));
    }
}
  1. My best guess is that there's some additional buffering in the Mono implementation of the File.Read operation.我最好的猜测是File.Read操作的 Mono 实现中有一些额外的缓冲。 Having recently looked into checksums on a large file, on a decent spec Windows machine you should expect roughly 6 seconds per Gb if all is running smoothly.最近研究了一个大文件的校验和,在一个体面的 Windows 机器上,如果一切运行顺利,你应该预计每 Gb 大约 6 秒。

    Oddly it has been reported in more than one benchmark test that SHA-512 is noticeably quicker than SHA-256 (see 3 below).奇怪的是,在不止一项基准测试中报告说,SHA-512 明显快于 SHA-256(参见下面的 3)。 One other possibility is that the problem is not in allocating the data, but in disposing of the bytes once read.另一种可能性是问题不在于分配数据,而在于处理读取的字节。 You may be able to use TransformBlock (and TransformFinalBlock ) on a single array rather than reading the stream in one big gulp—I have no idea if this will work, but it bears investigating.您也许可以在单个数组上使用TransformBlock (和TransformFinalBlock ),而不是一口气读取流——我不知道这是否可行,但值得研究。

  2. The difference between hashcode and checksum is (nearly) semantics.哈希码和校验和之间的区别是(几乎)语义。 They both calculate a shorter 'magic' number that is fairly unique to the data in the input, though if you have 4.6GB of input and 64B of output, 'fairly' is somewhat limited.它们都计算了一个较短的“魔术”数字,该数字对于输入中的数据来说是相当独特的,但如果您有 4.6GB 的输入和 64B 的输出,“相当”就有些有限。

    • A checksum is not secure, and with a bit of work you can figure out the input from enough outputs, work backwards from output to input and do all sorts of insecure things.校验和并不安全,通过一些工作,您可以从足够多的输出中找出输入,从输出到输入反向工作,并做各种不安全的事情。
    • A Cryptographic hash takes longer to calculate, but changing just one bit in the input will radically change the output and for a good hash (eg SHA-512) there's no known way of getting from output back to input.加密散列需要更长的时间来计算,但仅更改输入中的一位就会从根本上改变输出,并且对于良好的散列(例如 SHA-512),没有已知的方法可以从输出返回输入。
  3. MD5 is breakable: you can fabricate an input to produce any given output, if needed, on a PC. MD5 是可破解的:如果需要,您可以在 PC 上制作输入以产生任何给定的输出。 SHA-256 is (probably) still secure, but won't be in a few years time—if your project has a lifespan measured in decades, then assume you'll need to change it. SHA-256(可能)仍然是安全的,但几年后就不会了——如果你的项目的生命周期是几十年,那么假设你需要改变它。 SHA-512 has no known attacks and probably won't for quite a while, and since it's quicker than SHA-256 I'd recommend it anyway. SHA-512 没有已知的攻击,可能在很长一段时间内都不会发生,而且由于它比 SHA-256 快,我还是推荐它。 Benchmarks show it takes about 3 times longer to calculate SHA-512 than MD5, so if your speed issue can be dealt with, it's the way to go.基准测试表明,计算 SHA-512 所需的时间大约是 MD5 的 3 倍,因此,如果您的速度问题可以解决,那就是要走的路。

  4. No idea, beyond those mentioned above.不知道,除了上面提到的那些。 You're doing it right.你做得对。

For a bit of light reading, see Crypto.SE: SHA51 is faster than SHA256?要稍微了解一下,请参阅Crypto.SE:SHA51 比 SHA256 快?

Edit in response to question in comment编辑以回答评论中的问题

The purpose of a checksum is to allow you to check if a file has changed between the time you originally wrote it, and the time you come to use it.校验和的目的是让您检查文件在您最初编写它的时间和您开始使用它的时间之间是否发生了变化。 It does this by producing a small value (512 bits in the case of SHA512) where every bit of the original file contributes at least something to the output value.它通过生成一个小值(在 SHA512 的情况下为 512 位)来实现这一点,其中原始文件的每一位都至少对输出值有所贡献。 The purpose of a hashcode is the same, with the addition that it is really, really difficult for anyone else to get the same output value by making carefully managed changes to the file.哈希码的目的是相同的,此外,通过对文件进行精心管理的更改,其他任何人都很难获得相同的输出值。

The premise is that if the checksums are the same at the start and when you check it, then the files are the same, and if they're different the file has certainly changed.前提是如果开始时和检查时校验和相同,则文件相同,如果不同,则文件肯定已更改。 What you are doing above is feeding the file, in its entirety, through an algorithm that rolls, folds and spindles the bits it reads to produce the small value.您在上面所做的是通过一种算法将其读取的位滚动、折叠和旋转以产生较小的值,从而完整地提供文件。

As an example: in the application I'm currently writing, I need to know if parts of a file of any size have changed.例如:在我目前正在编写的应用程序中,我需要知道任何大小的文件的某些部分是否发生了变化。 I split the file into 16K blocks, take the SHA-512 hash of each block, and store it in a separate database on another drive.我将文件拆分为 16K 块,获取每个块的 SHA-512 哈希值,并将其存储在另一个驱动器上的单独数据库中。 When I come to see if the file has changed, I reproduce the hash for each block and compare it to the original.当我查看文件是否已更改时,我会复制每个块的哈希值并将其与原始值进行比较。 Since I'm using SHA-512, the chances of a changed file having the same hash are unimaginably small, so I can be confident of detecting changes in 100s of GB of data whilst only storing a few MB of hashes in my database.由于我使用的是 SHA-512,更改的文件具有相同散列的可能性小得难以想象,因此我可以自信地检测到 100 GB 数据的更改,同时只在我的数据库中存储几 MB 的散列。 I'm copying the file at the same time as taking the hash, and the process is entirely disk-bound;我在获取散列的同时复制文件,并且该过程完全受磁盘限制; it takes about 5 minutes to transfer a file to a USB drive, of which 10 seconds is probably related to hashing.将文件传输到 USB 驱动器大约需要 5 分钟,其中 10 秒可能与散列有关。

Lack of disk space to store hashes is a problem I can't solve in a post—buy a USB stick?缺少存储哈希的磁盘空间是我无法在帖子中解决的问题——购买 U 盘?

Way late to the party but seeing as none of the answers mentioned it, I wanted to point out:聚会迟到了,但看到没有人提到它,我想指出:

SHA256Managed is an implementation of the System.Security.Cryptography.HashAlgorithm class, and all of the functionality related to the read operations are handled in the inherited code. SHA256ManagedSystem.Security.Cryptography.HashAlgorithm类的实现,所有与读取操作相关的功能都在继承的代码中处理。

HashAlgorithm.ComputeHash(Stream) uses a fixed 4096 byte buffer to read data from a stream. HashAlgorithm.ComputeHash(Stream)使用固定的 4096 字节缓冲区从流中读取数据。 As a result, you're not really going to see much difference using a BufferedStream for this call.因此,使用BufferedStream进行此调用时,您实际上不会看到太大差异。

HashAlgorithm.ComputeHash(byte[]) operates on the entire byte array, but it resets the internal state after every call, so it can't be used to incrementally hash a buffered stream. HashAlgorithm.ComputeHash(byte[])对整个字节数组进行操作,但它在每次调用后都会重置内部状态,因此它不能用于增量散列缓冲流。

Your best bet would be to use a third party implementation that's optimized for your use case.您最好的选择是使用针对您的用例优化的第三方实现。

using (SHA256 SHA256 = SHA256Managed.Create())
            {
                using (FileStream fileStream = System.IO.File.OpenRead(filePath))
                {
                    string result = "";
                    foreach (var hash in SHA256.ComputeHash(fileStream))
                    {
                        result += hash.ToString("x2");
                    }

                    return result;
                }
            }

For Reference: https://www.c-sharpcorner.com/article/how-to-convert-a-byte-array-to-a-string/供参考: https : //www.c-sharpcorner.com/article/how-to-convert-a-byte-array-to-a-string/

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM