是否有更高性能的方法从字符串中删除罕见的不需要的字符？

Question

EDIT 编辑

Apologies if the original unedited question is misleading. 如果原始未经编辑的问题具有误导性，请道歉。

This question is not asking how to remove Invalid XML Chars from a string , answers to that question would be better directed here . 这个问题不是要求如何从string删除无效的XML Chars，这个问题的答案会更好地指向这里。

I'm not asking you to review my code. 我不是要你查看我的代码。

What I'm looking for in answers is, a function with the signature 我在答案中寻找的是具有签名的功能

string <YourName>(string input, Func<char, bool> check);

that will have performance similar or better than RemoveCharsBufferCopyBlackList . 其性能与RemoveCharsBufferCopyBlackList相似或更好。 Ideally this function would be more generic and if possible simpler to read, but these requirements are secondary. 理想情况下，此功能将更通用，如果可能更容易阅读，但这些要求是次要的。

I recently wrote a function to strip invalid XML chars from a string. 我最近编写了一个函数来从字符串中删除无效的XML字符。 In my application the strings can be modestly long and the invalid chars occur rarely. 在我的应用程序中，字符串可以适度长，并且很少出现无效字符。 This excerise got me thinking. 这种异常让我思考。 What ways can this be done in safe managed c# and, which would offer the best performance for my scenario. 在安全的托管c＃中可以采用哪些方法来实现，这将为我的方案提供最佳性能。

Here is my test program, I've subtituted the "valid XML predicate" for one the omits the char 'X' . 这是我的测试程序，我省略了“有效的XML谓词”，省略了char'X 'X' 。

class Program
{
    static void Main()
    {
        var attempts = new List<Func<string, Func<char, bool>, string>>
            {
                RemoveCharsLinqWhiteList,
                RemoveCharsFindAllWhiteList,
                RemoveCharsBufferCopyBlackList
            }

        const string GoodString = "1234567890abcdefgabcedefg";
        const string BadString = "1234567890abcdefgXabcedefg";
        const int Iterations = 100000;
        var timer = new StopWatch();

        var testSet = new List<string>(Iterations);
        for (var i = 0; i < Iterations; i++)
        {
            if (i % 1000 == 0)
            {
                testSet.Add(BadString);
            }
            else
            {
                testSet.Add(GoodString);
            }
        }

        foreach (var attempt in attempts)
        {
            //Check function works and JIT
            if (attempt.Invoke(BadString, IsNotUpperX) != GoodString)
            {
                throw new ApplicationException("Broken Function");       
            }

            if (attempt.Invoke(GoodString, IsNotUpperX) != GoodString)
            {
                throw new ApplicationException("Broken Function");       
            }

            timer.Reset();
            timer.Start();
            foreach (var t in testSet)
            {
                attempt.Invoke(t, IsNotUpperX);
            }

            timer.Stop();
            Console.WriteLine(
                "{0} iterations of function \"{1}\" performed in {2}ms",
                Iterations,
                attempt.Method,
                timer.ElapsedMilliseconds);
            Console.WriteLine();
        }

        Console.Readkey();
    }

    private static bool IsNotUpperX(char value)
    {
        return value != 'X';
    }

    private static string RemoveCharsLinqWhiteList(string input,
                                                      Func<char, bool> check);
    {
        return new string(input.Where(check).ToArray());
    }

    private static string RemoveCharsFindAllWhiteList(string input,
                                                      Func<char, bool> check);
    {
        return new string(Array.FindAll(input.ToCharArray(), check.Invoke));
    }

    private static string RemoveCharsBufferCopyBlackList(string input,
                                                      Func<char, bool> check);
    {
        char[] inputArray = null;
        char[] outputBuffer = null;

        var blackCount = 0;
        var lastb = -1;
        var whitePos = 0;

        for (var b = 0; b , input.Length; b++)
        {
            if (!check.invoke(input[b]))
            {
                var whites = b - lastb - 1;
                if (whites > 0)
                {
                    if (outputBuffer == null)
                    {
                        outputBuffer = new char[input.Length - blackCount];
                    }

                    if (inputArray == null)
                    {
                        inputArray = input.ToCharArray();
                    }

                    Buffer.BlockCopy(
                                      inputArray,
                                      (lastb + 1) * 2,
                                      outputBuffer,
                                      whitePos * 2,
                                      whites * 2);
                    whitePos += whites; 
                }

                lastb = b;
                blackCount++;
            }
        }

        if (blackCount == 0)
        {
            return input;
        }

        var remaining = inputArray.Length - 1 - lastb;
        if (remaining > 0)
        {
            Buffer.BlockCopy(
                              inputArray,
                              (lastb + 1) * 2,
                              outputBuffer,
                              whitePos * 2,
                              remaining * 2);

        }

        return new string(outputBuffer, 0, inputArray.Length - blackCount);
    }        
}

If you run the attempts you'll note that the performance improves as the functions get more specialised. 如果您运行尝试，您会注意到随着函数变得更加专业化，性能会提高。 Is there a faster and more generic way to perform this operation? 是否有更快，更通用的方法来执行此操作？ Or if there is no generic option is there a way that is just faster? 或者，如果没有通用选项，有没有更快的方法？

Please note that I am not actually interested in removing 'X' and in practice the predicate is more complicated. 请注意，我实际上并不想删除'X'，实际上谓词更复杂。

Answer 1

You certainly don't want to use LINQ to Objects aka enumerators to do this if you require high performance. 如果您需要高性能，您当然不希望使用LINQ to Objects aka枚举器来执行此操作。 Also, don't invoke a delegate per char. 另外，不要为每个char调用委托。 Delegate invocations are costly compared to the actual operation you are doing. 与您正在进行的实际操作相比，委托调用的成本很高。

RemoveCharsBufferCopyBlackList looks good (except for the delegate call per character). RemoveCharsBufferCopyBlackList看起来很好（除了每个字符的委托调用）。

I recommend that you inline the contents of the delegate hard-coded. 我建议你内联代表硬编码的内容。 Play around with different ways to write the condition. 用不同的方式来编写条件。 You may get better performance by first checking the current char against a range of known good chars (eg 0x20-0xFF) and if it matches let it through. 您可以通过首先针对一系列已知良好字符（例如0x20-0xFF）检查当前字符来获得更好的性能，如果匹配则让它通过。 This test will pass almost always so you can save the expensive checks against individual characters which are invalid in XML. 此测试几乎总是会通过，因此您可以针对XML中无效的单个字符保存昂贵的检查。

Edit: I just remembered I solved this problem a while ago: 编辑：我记得我刚才解决了这个问题：

    static readonly string invalidXmlChars =
        Enumerable.Range(0, 0x20)
        .Where(i => !(i == '\u000A' || i == '\u000D' || i == '\u0009'))
        .Select(i => (char)i)
        .ConcatToString()
        + "\uFFFE\uFFFF";
    public static string RemoveInvalidXmlChars(string str)
    {
        return RemoveInvalidXmlChars(str, false);
    }
    internal static string RemoveInvalidXmlChars(string str, bool forceRemoveSurrogates)
    {
        if (str == null) throw new ArgumentNullException("str");
        if (!ContainsInvalidXmlChars(str, forceRemoveSurrogates))
            return str;

        str = str.RemoveCharset(invalidXmlChars);
        if (forceRemoveSurrogates)
        {
            for (int i = 0; i < str.Length; i++)
            {
                if (IsSurrogate(str[i]))
                {
                    str = str.Where(c => !IsSurrogate(c)).ConcatToString();
                    break;
                }
            }
        }

        return str;
    }
    static bool IsSurrogate(char c)
    {
        return c >= 0xD800 && c < 0xE000;
    }
    internal static bool ContainsInvalidXmlChars(string str)
    {
        return ContainsInvalidXmlChars(str, false);
    }
    public static bool ContainsInvalidXmlChars(string str, bool forceRemoveSurrogates)
    {
        if (str == null) throw new ArgumentNullException("str");
        for (int i = 0; i < str.Length; i++)
        {
            if (str[i] < 0x20 && !(str[i] == '\u000A' || str[i] == '\u000D' || str[i] == '\u0009'))
                return true;
            if (str[i] >= 0xD800)
            {
                if (forceRemoveSurrogates && str[i] < 0xE000)
                    return true;
                if ((str[i] == '\uFFFE' || str[i] == '\uFFFF'))
                    return true;
            }
        }
        return false;
    }

Notice, that RemoveInvalidXmlChars first invokes ContainsInvalidXmlChars to save the string allocation. 请注意，RemoveInvalidXmlChars首先调用ContainsInvalidXmlChars来保存字符串分配。 Most strings do not contain invalid XML chars so we can be optimistic. 大多数字符串不包含无效的XML字符，因此我们可以保持乐观。

是否有更高性能的方法从字符串中删除罕见的不需要的字符？

问题描述

1 个解决方案

解决方案1
2 已采纳 2012-04-20 17:37:53

是否有更高性能的方法从字符串中删除罕见的不需要的字符？

问题描述

1 个解决方案

解决方案1 2 已采纳 2012-04-20 17:37:53

解决方案1
2 已采纳 2012-04-20 17:37:53