简体   繁体   English

是否有更高性能的方法从字符串中删除罕见的不需要的字符?

[英]Is there a higher performance method for removing rare unwanted chars from a string?

EDIT 编辑

Apologies if the original unedited question is misleading. 如果原始未经编辑的问题具有误导性,请道歉。

This question is not asking how to remove Invalid XML Chars from a string , answers to that question would be better directed here . 这个问题不是要求如何从string删除无效的XML Chars,这个问题的答案会更好地指向这里

I'm not asking you to review my code. 我不是要你查看我的代码。

What I'm looking for in answers is, a function with the signature 我在答案中寻找的是具有签名的功能

string <YourName>(string input, Func<char, bool> check);

that will have performance similar or better than RemoveCharsBufferCopyBlackList . 其性能与RemoveCharsBufferCopyBlackList相似或更好。 Ideally this function would be more generic and if possible simpler to read, but these requirements are secondary. 理想情况下,此功能将更通用,如果可能更容易阅读,但这些要求是次要的。


I recently wrote a function to strip invalid XML chars from a string. 我最近编写了一个函数来从字符串中删除无效的XML字符。 In my application the strings can be modestly long and the invalid chars occur rarely. 在我的应用程序中,字符串可以适度长,并且很少出现无效字符。 This excerise got me thinking. 这种异常让我思考。 What ways can this be done in safe managed c# and, which would offer the best performance for my scenario. 在安全的托管c#中可以采用哪些方法来实现,这将为我的方案提供最佳性能。

Here is my test program, I've subtituted the "valid XML predicate" for one the omits the char 'X' . 这是我的测试程序,我省略了“有效的XML谓词”,省略了char'X 'X'

class Program
{
    static void Main()
    {
        var attempts = new List<Func<string, Func<char, bool>, string>>
            {
                RemoveCharsLinqWhiteList,
                RemoveCharsFindAllWhiteList,
                RemoveCharsBufferCopyBlackList
            }

        const string GoodString = "1234567890abcdefgabcedefg";
        const string BadString = "1234567890abcdefgXabcedefg";
        const int Iterations = 100000;
        var timer = new StopWatch();

        var testSet = new List<string>(Iterations);
        for (var i = 0; i < Iterations; i++)
        {
            if (i % 1000 == 0)
            {
                testSet.Add(BadString);
            }
            else
            {
                testSet.Add(GoodString);
            }
        }

        foreach (var attempt in attempts)
        {
            //Check function works and JIT
            if (attempt.Invoke(BadString, IsNotUpperX) != GoodString)
            {
                throw new ApplicationException("Broken Function");       
            }

            if (attempt.Invoke(GoodString, IsNotUpperX) != GoodString)
            {
                throw new ApplicationException("Broken Function");       
            }

            timer.Reset();
            timer.Start();
            foreach (var t in testSet)
            {
                attempt.Invoke(t, IsNotUpperX);
            }

            timer.Stop();
            Console.WriteLine(
                "{0} iterations of function \"{1}\" performed in {2}ms",
                Iterations,
                attempt.Method,
                timer.ElapsedMilliseconds);
            Console.WriteLine();
        }

        Console.Readkey();
    }

    private static bool IsNotUpperX(char value)
    {
        return value != 'X';
    }

    private static string RemoveCharsLinqWhiteList(string input,
                                                      Func<char, bool> check);
    {
        return new string(input.Where(check).ToArray());
    }

    private static string RemoveCharsFindAllWhiteList(string input,
                                                      Func<char, bool> check);
    {
        return new string(Array.FindAll(input.ToCharArray(), check.Invoke));
    }

    private static string RemoveCharsBufferCopyBlackList(string input,
                                                      Func<char, bool> check);
    {
        char[] inputArray = null;
        char[] outputBuffer = null;

        var blackCount = 0;
        var lastb = -1;
        var whitePos = 0;

        for (var b = 0; b , input.Length; b++)
        {
            if (!check.invoke(input[b]))
            {
                var whites = b - lastb - 1;
                if (whites > 0)
                {
                    if (outputBuffer == null)
                    {
                        outputBuffer = new char[input.Length - blackCount];
                    }

                    if (inputArray == null)
                    {
                        inputArray = input.ToCharArray();
                    }

                    Buffer.BlockCopy(
                                      inputArray,
                                      (lastb + 1) * 2,
                                      outputBuffer,
                                      whitePos * 2,
                                      whites * 2);
                    whitePos += whites; 
                }

                lastb = b;
                blackCount++;
            }
        }

        if (blackCount == 0)
        {
            return input;
        }

        var remaining = inputArray.Length - 1 - lastb;
        if (remaining > 0)
        {
            Buffer.BlockCopy(
                              inputArray,
                              (lastb + 1) * 2,
                              outputBuffer,
                              whitePos * 2,
                              remaining * 2);

        }

        return new string(outputBuffer, 0, inputArray.Length - blackCount);
    }        
}

If you run the attempts you'll note that the performance improves as the functions get more specialised. 如果您运行尝试,您会注意到随着函数变得更加专业化,性能会提高。 Is there a faster and more generic way to perform this operation? 是否有更快,更通用的方法来执行此操作? Or if there is no generic option is there a way that is just faster? 或者,如果没有通用选项,有没有更快的方法?

Please note that I am not actually interested in removing 'X' and in practice the predicate is more complicated. 请注意,我实际上并不想删除'X',实际上谓词更复杂。

You certainly don't want to use LINQ to Objects aka enumerators to do this if you require high performance. 如果您需要高性能,您当然不希望使用LINQ to Objects aka枚举器来执行此操作。 Also, don't invoke a delegate per char. 另外,不要为每个char调用委托。 Delegate invocations are costly compared to the actual operation you are doing. 与您正在进行的实际操作相比,委托调用的成本很高。

RemoveCharsBufferCopyBlackList looks good (except for the delegate call per character). RemoveCharsBufferCopyBlackList看起来很好(除了每个字符的委托调用)。

I recommend that you inline the contents of the delegate hard-coded. 我建议你内联代表硬编码的内容。 Play around with different ways to write the condition. 用不同的方式来编写条件。 You may get better performance by first checking the current char against a range of known good chars (eg 0x20-0xFF) and if it matches let it through. 您可以通过首先针对一系列已知良好字符(例如0x20-0xFF)检查当前字符来获得更好的性能,如果匹配则让它通过。 This test will pass almost always so you can save the expensive checks against individual characters which are invalid in XML. 此测试几乎总是会通过,因此您可以针对XML中无效的单个字符保存昂贵的检查。

Edit: I just remembered I solved this problem a while ago: 编辑:我记得我刚才解决了这个问题:

    static readonly string invalidXmlChars =
        Enumerable.Range(0, 0x20)
        .Where(i => !(i == '\u000A' || i == '\u000D' || i == '\u0009'))
        .Select(i => (char)i)
        .ConcatToString()
        + "\uFFFE\uFFFF";
    public static string RemoveInvalidXmlChars(string str)
    {
        return RemoveInvalidXmlChars(str, false);
    }
    internal static string RemoveInvalidXmlChars(string str, bool forceRemoveSurrogates)
    {
        if (str == null) throw new ArgumentNullException("str");
        if (!ContainsInvalidXmlChars(str, forceRemoveSurrogates))
            return str;

        str = str.RemoveCharset(invalidXmlChars);
        if (forceRemoveSurrogates)
        {
            for (int i = 0; i < str.Length; i++)
            {
                if (IsSurrogate(str[i]))
                {
                    str = str.Where(c => !IsSurrogate(c)).ConcatToString();
                    break;
                }
            }
        }

        return str;
    }
    static bool IsSurrogate(char c)
    {
        return c >= 0xD800 && c < 0xE000;
    }
    internal static bool ContainsInvalidXmlChars(string str)
    {
        return ContainsInvalidXmlChars(str, false);
    }
    public static bool ContainsInvalidXmlChars(string str, bool forceRemoveSurrogates)
    {
        if (str == null) throw new ArgumentNullException("str");
        for (int i = 0; i < str.Length; i++)
        {
            if (str[i] < 0x20 && !(str[i] == '\u000A' || str[i] == '\u000D' || str[i] == '\u0009'))
                return true;
            if (str[i] >= 0xD800)
            {
                if (forceRemoveSurrogates && str[i] < 0xE000)
                    return true;
                if ((str[i] == '\uFFFE' || str[i] == '\uFFFF'))
                    return true;
            }
        }
        return false;
    }

Notice, that RemoveInvalidXmlChars first invokes ContainsInvalidXmlChars to save the string allocation. 请注意,RemoveInvalidXmlChars首先调用ContainsInvalidXmlChars来保存字符串分配。 Most strings do not contain invalid XML chars so we can be optimistic. 大多数字符串不包含无效的XML字符,因此我们可以保持乐观。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM