简体   繁体   English

C#:从字符串中删除常见的无效字符:改进此算法

[英]C#: Removing common invalid characters from a string: improve this algorithm

Consider the requirement to strip invalid characters from a string. 考虑从字符串中删除无效字符的要求。 The characters just need to be removed and replace with blank or string.Empty . 只需要删除字符并替换为blank或string.Empty

char[] BAD_CHARS = new char[] { '!', '@', '#', '$', '%', '_' }; //simple example

foreach (char bad in BAD_CHARS)
{
    if (someString.Contains(bad))
      someString = someString.Replace(bad.ToString(), string.Empty);
}

I'd have really liked to do this: 真的很喜欢这样做:

if (BAD_CHARS.Any(bc => someString.Contains(bc)))
    someString.Replace(bc,string.Empty); // bc is out of scope

Question: Do you have any suggestions on refactoring this algoritm, or any simpler, easier to read, performant, maintainable algorithms? 问题:您对重构此算法或任何更简单,更易于阅读,高性能,可维护的算法有什么建议吗?

我不知道它的可读性,但正则表达式可以满足您的需要:

someString = Regex.Replace(someString, @"[!@#$%_]", "");
char[] BAD_CHARS = new char[] { '!', '@', '#', '$', '%', '_' }; //simple example
someString = string.Concat(someString.Split(BAD_CHARS,StringSplitOptions.RemoveEmptyEntries));

should do the trick (sorry for any smaller syntax errors I'm on my phone) 应该做的伎俩(抱歉我的手机上有任何较小的语法错误)

The string class is immutable (although a reference type), hence all its static methods are designed to return a new string variable. string类是不可变的(虽然是引用类型),因此它的所有静态方法都被设计为返回一个新的 string变量。 Calling someString.Replace without assigning it to anything will not have any effect in your program. 调用someString.Replace而不将其分配给任何东西将不会对您的程序产生任何影响。 - Seems like you fixed this problem. - 好像你解决了这个问题。

The main issue with your suggested algorithm is that it repeatedly assigning many new string variables, potentially causing a big performance hit. 您建议的算法的主要问题是它重复分配许多新的string变量,可能会导致性能大幅下降。 LINQ doesn't really help things here. LINQ在这里并没有真正的帮助。 (I doesn't make the code significantly shorter and certainly not any more readable, in my opinion.) (在我看来,我不会使代码明显缩短,当然也不会更具可读性。)

Try the following extension method. 请尝试以下扩展方法。 The key is the use of StringBuilder , which means only one block of memory is assigned for the result during execution. 关键是使用StringBuilder ,这意味着在执行期间只为结果分配了一个内存块。

private static readonly HashSet<char> badChars = 
    new HashSet<char> { '!', '@', '#', '$', '%', '_' };

public static string CleanString(this string str)
{
    var result = new StringBuilder(str.Length);
    for (int i = 0; i < str.Length; i++)
    {
        if (!badChars.Contains(str[i]))
            result.Append(str[i]);
    }
    return result.ToString();
}

This algorithm also makes use of the .NET 3.5 'HashSet' class to give O(1) look up time for detecting a bad char. 该算法还利用.NET 3.5'HashSet'类为O(1)查找检测错误char的时间。 This makes the overall algorithm O(n) rather than the O(nm) of your posted one ( m being the number of bad chars); 这使得整体算法O(n)而不是您发布的算法的O(nm)m是坏字符的数量); it also is lot a better with memory usage, as explained above. 如上所述,内存使用情况也好得多。

This one is faster than HashSet<T> . 这个更快HashSet<T> Also, if you have to perform this action often, please consider the foundations for this question I asked here . 此外,如果您必须经常执行此操作,请考虑我在此处提出的此问题的基础。

private static readonly bool[] BadCharValues;

static StaticConstructor()
{
    BadCharValues = new bool[char.MaxValue+1];
    char[] badChars = { '!', '@', '#', '$', '%', '_' };
    foreach (char c in badChars)
        BadCharValues[c] = true;
}

public static string CleanString(string str)
{
    var result = new StringBuilder(str.Length);
    for (int i = 0; i < str.Length; i++)
    {
        if (!BadCharValues[str[i]])
            result.Append(str[i]);
    }
    return result.ToString();
}

if you still want to do it in a LINQy way: 如果您仍想以LINQy方式执行此操作:

public static string CleanUp(this string orig)
{
    var badchars = new HashSet<char>() { '!', '@', '#', '$', '%', '_' };

    return new string(orig.Where(c => !badchars.Contains(c)).ToArray());
}

Extra tip: If you don't want to remember the array of char that are invalid for Files, you could use Path.GetInvalidFileNameChars() . 额外提示:如果您不想记住对文件无效的char数组,可以使用Path.GetInvalidFileNameChars() If you wanted it for Paths, it's Path.GetInvalidPathChars 如果你想要Paths,那就是Path.GetInvalidPathChars

private static string RemoveInvalidChars(string str)
            {
                return string.Concat(str.Split(Path.GetInvalidFileNameChars(), StringSplitOptions.RemoveEmptyEntries));
            }

Why would you have REALLY LIKED to do that? 你为什么真的喜欢这样做? The code is absolutely no simpler, you're just forcing a query extension method into your code. 代码绝对不简单,你只是强迫查询扩展方法进入你的代码。

As an aside, the Contains check seems redundant, both conceptually and from a performance perspective. 另外, Contains检查在概念上和从性能角度看都是多余的。 Contains has to run through the whole string anyway, you may as well just call Replace(bad.ToString(), string.Empty) for every character and forget about whether or not it's actually present. 无论如何, Contains必须遍历整个字符串,你也可以为每个字符调用Replace(bad.ToString(), string.Empty)并忘记它是否实际存在。

Of course, a regular expression is always an option, and may be more performant (if not less clear) in a situation like this. 当然,正则表达式总是一种选择,并且在这种情况下可能更具性能(如果不是更不清楚)。

Something to consider -- if this is for passwords (say), you want to scan for and keep good characters , and assume everything else is bad. 需要考虑的事项 - 如果这是用于密码(比如说),你想要扫描并保留好的角色 ,并假设其他一切都不好。 Its easier to correctly filter or good things, then try to guess all bad things. 它更容易正确过滤或好事,然后尝试猜测所有坏事。

For Each Character If Character is Good -> Keep it (copy to out buffer, whatever.) 对于每个字符如果字符是好的 - >保留它(复制到输出缓冲区,无论如何。)

jeff 杰夫

This is pretty clean. 这很干净。 Restricts it to valid characters instead of removing invalid ones. 将其限制为有效字符,而不是删除无效字符。 You should split it to constants probably: 您可能应该将其拆分为常量:

string clean = new string(@"Sour!ce Str&*(@ing".Where(c => 
@"abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ ,.".Contains(c)).ToArray()

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM