[英]Most efficient way to find pattern in byte array
I have this following code:我有以下代码:
var file = //Memory stream with a file in it
var bytes = file.ToArray();
I need to search the bytes
for the first occurrence (if any) of the specified byte sequence: 0xff, 0xd8.我需要搜索
bytes
用于指定的字节序列的第一次出现(如果有的话):0xff的,0xd8。 (The purpose of this is to find images embedded in files) (这样做的目的是找到嵌入在文件中的图像)
So if for example bytes[6501]
contains 0xff
and bytes[6502]
contains 0xd8
, thats a match and I need either the index of the position returned (6501), or a new array, which is a copy of the bytes array, except it doesn't have the keys below 6501 from the old array.因此,例如,如果
bytes[6501]
包含0xff
并且bytes[6502]
包含0xd8
,那就是匹配项,我需要返回的位置索引(6501)或一个新数组,它是字节数组的副本,除了它没有旧数组中低于 6501 的键。
My current solution is to loop:我目前的解决方案是循环:
for (var index = 0; index < bytes.Length; index++)
{
if((new byte[] {0xff, 0xd8}).SequenceEqual(bytes.Skip(index).Take(2))
...
But it's pretty slow when it's handling bigger files.但是在处理更大的文件时它非常慢。
Is there some more efficient way to handle this?有没有更有效的方法来处理这个问题?
If this is time-critical code, I found the C# compiler (both Mono's implementation and Microsoft's) to have special logic to optimize simple scan loops.如果这是对时间要求严格的代码,我发现 C# 编译器(Mono 的实现和 Microsoft 的)具有特殊的逻辑来优化简单的扫描循环。
So from profiling experience, I'd implement a sequence search with a hardcoded first-element search like this:因此,根据分析经验,我将使用硬编码的第一个元素搜索来实现序列搜索,如下所示:
/// <summary>Looks for the next occurrence of a sequence in a byte array</summary>
/// <param name="array">Array that will be scanned</param>
/// <param name="start">Index in the array at which scanning will begin</param>
/// <param name="sequence">Sequence the array will be scanned for</param>
/// <returns>
/// The index of the next occurrence of the sequence of -1 if not found
/// </returns>
private static int findSequence(byte[] array, int start, byte[] sequence) {
int end = array.Length - sequence.Length; // past here no match is possible
byte firstByte = sequence[0]; // cached to tell compiler there's no aliasing
while(start <= end) {
// scan for first byte only. compiler-friendly.
if(array[start] == firstByte) {
// scan for rest of sequence
for (int offset = 1;; ++offset) {
if(offset == sequence.Length) { // full sequence matched?
return start;
} else if(array[start + offset] != sequence[offset]) {
break;
}
}
}
++start;
}
// end of array reached without match
return -1;
}
Quite a bit longer than other suggestions and prone to off-by-1 errors, but if you're scanning a huge chunk of data or doing this for frequent device IO, this setup will avoid feeding the garbage collector and optimize very well.比其他建议要长很多,并且容易出现 off-by-1 错误,但是如果您正在扫描大量数据或为频繁的设备 IO 执行此操作,则此设置将避免馈送垃圾收集器并进行很好的优化。
EDIT 2019-10-03: Fixed issues pointed out by Warren Rox.编辑 2019-10-03:修复了 Warren Rox 指出的问题。 Thanks!
谢谢! Tests: https://ideone.com/mmACYj
测试: https : //ideone.com/mmACYj
You want to be using a for loop to check your array.您想使用 for 循环来检查您的数组。 The reason why your code is slow is rather simple.
你的代码慢的原因很简单。
Decompilation shows why:反编译显示了原因:
public static IEnumerable<TSource> Skip<TSource>(this IEnumerable<TSource> source, int count)
{
if (source == null)
throw Error.ArgumentNull("source");
else
return Enumerable.SkipIterator<TSource>(source, count);
}
private static IEnumerable<TSource> SkipIterator<TSource>(IEnumerable<TSource> source, int count)
{
using (IEnumerator<TSource> enumerator = source.GetEnumerator())
{
while (count > 0 && enumerator.MoveNext())
--count;
if (count <= 0)
{
while (enumerator.MoveNext())
yield return enumerator.Current;
}
}
}
For each for you're looping you're performing a skip, basically unnecessairily iterating over your array again.对于每个循环,您都在执行跳过,基本上不必要地再次遍历您的数组。
SOME Linq operations contain optimizations to use indexers when possible - skip is not one of them unfortunately.一些 Linq 操作包含在可能的情况下使用索引器的优化 - 不幸的是跳过不是其中之一。
PS: PS:
If i was you i'd change your code to something like如果我是你,我会把你的代码改成类似的
var search = new byte[] {0xff, 0xd8};
var current = new byte[2];
var maxSearchRange = bytes.Length -1;
for (var index = 0; index < maxSearchRange; index++)
{
current[0] = bytes[index];
current[1] = bytes[index+1];
if((search).SequenceEqual(current))
...
Is there a downside to a simple linear search?简单的线性搜索有缺点吗?
Returns start index if found, else -1如果找到则返回起始索引,否则返回 -1
private const byte First = 0x0ff;
private const byte Second = 0x0d8;
private static int FindImageStart(IList<byte> bytes) {
for (var index = 0; index < bytes.Count - 1; index++) {
if (bytes[index] == First && bytes[index + 1] == Second) {
return index;
}
}
return -1;
}
public int FindSequence(byte[] source, byte[] seq)
{
var start = -1;
for (var i = 0; i < source.Length - seq.Length + 1 && start == -1; i++)
{
var j = 0;
for (; j < seq.Length && source[i+j] == seq[j]; j++) {}
if (j == seq.Length) start = i;
}
return start;
}
How about simple..?简单的怎么样..?
bytes[] pattern = new bytes[] { 1, 2, 3, 4, 5 };
for (var index = 0, end = bytes.Length - pattern.length; index < end; index++)
{
bool found = false;
for(int j = 0; j < pattern.Length && !found; j++)
{
found = bytes[index + j] == pattern[j];
}
if(found)
return index;
}
Please note i did not code in c# for a looong time so excuse me syntax errors if there are any.请注意,我有很长一段时间没有在 c# 中编写代码,所以请原谅我的语法错误(如果有的话)。 Regard this as pseudo-code (that no longer throws an index error) :)
将此视为伪代码(不再引发索引错误):)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.