[英]The fastest method of searching a BINARY file(900mb - 4.5gb) for byte[] and grabbing the offset. C#
基本上我想要一种更快且更有效的方法来搜索二进制文件中的字节数组并获取偏移量。 byte[] 可以包含 5 - 50 个字节。 它将读取 1 次搜索我有一个 function 不能正常工作并且非常慢:
static long ReadOneSrch(BinaryReader reader, byte[] bytes)
{
int b;
long i = 0;
while ((b = reader.BaseStream.ReadByte()) != -1)
{
if (b == bytes[i++])
{
if (i == bytes.Length)
return reader.BaseStream.Position - bytes.Length;
}
else
i = b == bytes[0] ? 1 : 0;
}
return -1;
}
这是我在 stream 上使用 Boyer-Moore-Horspool 的实现。 核心 BMH 实现基本上是从Boyer-Moore-Horspool Algorithm for All Matches (Find Byte array inside Byte array)复制而来的。
该方法反复将 stream 读取到缓冲区中,并对缓冲区应用 BMH 算法,直到我们得到匹配。 为了也找到跨越两个这样的读取的匹配,我们总是将最后一个 pattern.Length 字节从前一个读取传输到缓冲区的头部(通过评估可能的匹配开始已经排除了一些努力,可以更聪明地完成 - 但如果图案不太长,您几乎不会注意到差异)。
/// <summary>
/// Finds the first occurrence of <paramref name="pattern"/> in a stream
/// </summary>
/// <param name="s">The input stream</param>
/// <param name="pattern">The pattern</param>
/// <returns>The index of the first occurrence, or -1 if the pattern has not been found</returns>
public static long IndexOf(Stream s, byte[] pattern)
{
// Prepare the bad character array is done once in a separate step
var badCharacters = MakeBadCharArray(pattern);
// We now repeatedly read the stream into a buffer and apply the Boyer-Moore-Horspool algorithm on the buffer until we get a match
var buffer = new byte[Math.Max(2 * pattern.Length, 4096)];
long offset = 0; // keep track of the offset in the input stream
while (true)
{
int dataLength;
if (offset == 0)
{
// the first time we fill the whole buffer
dataLength = s.Read(buffer, 0, buffer.Length);
}
else
{
// Later, copy the last pattern.Length bytes from the previous buffer to the start and fill up from the stream
// This is important so we can also find matches which are partly in the old buffer
Array.Copy(buffer, buffer.Length - pattern.Length, buffer, 0, pattern.Length);
dataLength = s.Read(buffer, pattern.Length, buffer.Length - pattern.Length) + pattern.Length;
}
var index = IndexOf(buffer, dataLength, pattern, badCharacters);
if (index >= 0)
return offset + index; // found!
if (dataLength < buffer.Length)
break;
offset += dataLength - pattern.Length;
}
return -1;
}
// --- Boyer-Moore-Horspool algorithm ---
// (Slightly modified code from
// https://stackoverflow.com/questions/16252518/boyer-moore-horspool-algorithm-for-all-matches-find-byte-array-inside-byte-arra)
// Prepare the bad character array is done once in a separate step:
private static int[] MakeBadCharArray(byte[] pattern)
{
var badCharacters = new int[256];
for (long i = 0; i < 256; ++i)
badCharacters[i] = pattern.Length;
for (var i = 0; i < pattern.Length - 1; ++i)
badCharacters[pattern[i]] = pattern.Length - 1 - i;
return badCharacters;
}
// Core of the BMH algorithm
private static int IndexOf(byte[] value, int valueLength, byte[] pattern, int[] badCharacters)
{
int index = 0;
while (index <= valueLength - pattern.Length)
{
for (var i = pattern.Length - 1; value[index + i] == pattern[i]; --i)
{
if (i == 0)
return index;
}
index += badCharacters[value[index + pattern.Length - 1]];
}
return -1;
}
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.