简体   繁体   English

C# Boyer-Moore Algorithm with needle 可以包含 null 值作为通配符

[英]C# Boyer-Moore Algorithm with needle can contain a null value as a wildcard

I attempted to implement the Boyer-Moore algorithm in C#, with the ability to use null as a wildcard in the needle (pattern).我试图在 C# 中实现 Boyer-Moore 算法,能够使用 null 作为针(模式)中的通配符。

class BoyerMoore
{
    private readonly int[] _badChar;
    private readonly byte?[] _needle;

    public BoyerMoore(byte?[] needle)
    {
        _needle = needle;
        _badChar = new int[256];

        // Pre-processing for bad character heuristic
        for (int i = 0; i < _badChar.Length; i++)
        {
            _badChar[i] = -1;
        }
        for (int i = 0; i < needle.Length; i++)
        {
            if (needle[i] != null)
                _badChar[needle[i].Value] = i;
        }
    }

    public List<int> Search(byte[] haystack)
    {
        List<int> occurrences = new List<int>();
        int i = 0;
        while (i <= haystack.Length - _needle.Length)
        {
            int j;

            for (j = _needle.Length - 1; j >= 0; j--)
            {
                if (_needle[j] == null) continue;
                if (_needle[j] != haystack[i + j]) break;
            }

            if (j < 0)
            {
                occurrences.Add(i);
                i++;
            }
            else
            {
                i += Math.Max(1, j - _badChar[haystack[i + j]]);
            }
        }
        return occurrences;
    }
}

My code works correctly when the needle does not contain a null, but it does not work properly when the needle contains a null, such as 0xAA, 0xBB, null, 0xCC .当针不包含 null 时,我的代码可以正常工作,但当针包含 null(例如0xAA, 0xBB, null, 0xCC时,它无法正常工作。 (missing some results.) (缺少一些结果。)

Am I overlooking something or is it not possible to implement the Boyer-Moore Bad Character heuristic with a null wildcard?我是否忽略了某些东西,或者是否无法使用 null 通配符实施 Boyer-Moore Bad Character 启发式算法?

I searched on Google, but I don't see any examples, tutorials or explains something using null values as wildcards, so I ask.我在谷歌上搜索过,但我没有看到任何示例、教程或解释使用 null 值作为通配符的内容,所以我问了。

The Boyer-Moore bad character rule doesn't work with wildcards. Boyer-Moore 不良字符规则不适用于通配符。 Basically, you have to throw away everything to the left of the rightmost wildcard when operating that rule.基本上,在操作该规则时,您必须丢弃最右边通配符左侧的所有内容。

Consider the _badChar array gives you offsets to move the search position quickly where the character that you see at the current haystack position is not in the needle.考虑_badChar数组为您提供偏移量以快速移动搜索 position,其中您在当前干草堆 position 中看到的字符不在针中。 If you have a wildcard in your needle, then that could match any character, and right now you are saying that those characters are not matched by anything in the needle by setting their position to -1.如果你的 needle 中有一个通配符,那么它可以匹配任何字符,现在你通过将它们的 position 设置为 -1 来表示这些字符与 needle 中的任何内容都不匹配。

So, rather than set the entry in _badChar for each value not explicitly present in the needle to -1, you want to set the entry to the last wildcard position in the needle.因此,与其将针中未明确存在的每个值的_badChar中的条目设置为 -1,不如将条目设置为针中的最后一个通配符 position。

But you also have to set the position for every other character that is in the needle to no further left than this point, because it could match them too.但是您还必须将针中的每个其他字符设置position,使其不比该点更靠左,因为它也可以匹配它们。

You can do that in the setup:您可以在设置中执行此操作:

for (int i = 0 ; i < needle.Length; i++)
    if (needle[i] != null)
        _badChar[needle[i].Value] = i;
    else
        _lastNull = i;
for (int i = 0; i < _badChar.Length; i++)
    if (_badChar[i] < _lastNull)
        _badChar[i] = _lastNull;

But now you are effectively running Boyer-Moore on only the rightmost part of the needle, and doing a linear match on the rest of the needle each time that the Boyer-Moore algorithm gives you a match.但是现在您实际上只在针的最右边部分运行 Boyer-Moore,并且每次 Boyer-Moore 算法为您提供匹配时,都会对针的 rest 进行线性匹配。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM