在 BINARY 文件（900mb - 4.5gb）中搜索 byte[] 并获取偏移量的最快方法。 C#

Question

基本上我想要一种更快且更有效的方法来搜索二进制文件中的字节数组并获取偏移量。 byte[] 可以包含 5 - 50 个字节。 它将读取 1 次搜索我有一个 function 不能正常工作并且非常慢：

        static long ReadOneSrch(BinaryReader reader, byte[] bytes)
    {
        int b;
        long i = 0;
        while ((b = reader.BaseStream.ReadByte()) != -1)
        {
            if (b == bytes[i++])
            {
                if (i == bytes.Length)
                    return reader.BaseStream.Position - bytes.Length;
            }
            else
                i = b == bytes[0] ? 1 : 0;
        }

        return -1;
    }

Answer 1

这是我在 stream 上使用 Boyer-Moore-Horspool 的实现。 核心 BMH 实现基本上是从Boyer-Moore-Horspool Algorithm for All Matches (Find Byte array inside Byte array)复制而来的。

该方法反复将 stream 读取到缓冲区中，并对缓冲区应用 BMH 算法，直到我们得到匹配。 为了也找到跨越两个这样的读取的匹配，我们总是将最后一个 pattern.Length 字节从前一个读取传输到缓冲区的头部（通过评估可能的匹配开始已经排除了一些努力，可以更聪明地完成 - 但如果图案不太长，您几乎不会注意到差异）。

    /// <summary>
    /// Finds the first occurrence of <paramref name="pattern"/> in a stream
    /// </summary>
    /// <param name="s">The input stream</param>
    /// <param name="pattern">The pattern</param>
    /// <returns>The index of the first occurrence, or -1 if the pattern has not been found</returns>
    public static long IndexOf(Stream s, byte[] pattern)
    {
        // Prepare the bad character array is done once in a separate step
        var badCharacters = MakeBadCharArray(pattern);

        // We now repeatedly read the stream into a buffer and apply the Boyer-Moore-Horspool algorithm on the buffer until we get a match
        var buffer = new byte[Math.Max(2 * pattern.Length, 4096)];
        long offset = 0; // keep track of the offset in the input stream
        while (true)
        {
            int dataLength;
            if (offset == 0)
            {
                // the first time we fill the whole buffer
                dataLength = s.Read(buffer, 0, buffer.Length);
            }
            else
            {
                // Later, copy the last pattern.Length bytes from the previous buffer to the start and fill up from the stream
                // This is important so we can also find matches which are partly in the old buffer
                Array.Copy(buffer, buffer.Length - pattern.Length, buffer, 0, pattern.Length);
                dataLength = s.Read(buffer, pattern.Length, buffer.Length - pattern.Length) + pattern.Length;
            }

            var index = IndexOf(buffer, dataLength, pattern, badCharacters);
            if (index >= 0)
                return offset + index; // found!
            if (dataLength < buffer.Length)
                break;
            offset += dataLength - pattern.Length;
        }

        return -1;
    }

    // --- Boyer-Moore-Horspool algorithm ---
    // (Slightly modified code from
    // https://stackoverflow.com/questions/16252518/boyer-moore-horspool-algorithm-for-all-matches-find-byte-array-inside-byte-arra)
    // Prepare the bad character array is done once in a separate step:
    private static int[] MakeBadCharArray(byte[] pattern)
    {
        var badCharacters = new int[256];

        for (long i = 0; i < 256; ++i)
            badCharacters[i] = pattern.Length;

        for (var i = 0; i < pattern.Length - 1; ++i)
            badCharacters[pattern[i]] = pattern.Length - 1 - i;

        return badCharacters;
    }

    // Core of the BMH algorithm
    private static int IndexOf(byte[] value, int valueLength, byte[] pattern, int[] badCharacters)
    {
        int index = 0;

        while (index <= valueLength - pattern.Length)
        {
            for (var i = pattern.Length - 1; value[index + i] == pattern[i]; --i)
            {
                if (i == 0)
                    return index;
            }

            index += badCharacters[value[index + pattern.Length - 1]];
        }

        return -1;
    }

在 BINARY 文件（900mb - 4.5gb）中搜索 byte[] 并获取偏移量的最快方法。 C#

问题描述

1 个解决方案

解决方案1
1 已采纳 2020-05-22 04:40:21

在 BINARY 文件（900mb - 4.5gb）中搜索 byte[] 并获取偏移量的最快方法。 C#

问题描述

1 个解决方案

解决方案1 1 已采纳 2020-05-22 04:40:21

解决方案1
1 已采纳 2020-05-22 04:40:21