简体   繁体   中英

C#: Removing common invalid characters from a string: improve this algorithm

Consider the requirement to strip invalid characters from a string. The characters just need to be removed and replace with blank or string.Empty .

char[] BAD_CHARS = new char[] { '!', '@', '#', '$', '%', '_' }; //simple example

foreach (char bad in BAD_CHARS)
{
    if (someString.Contains(bad))
      someString = someString.Replace(bad.ToString(), string.Empty);
}

I'd have really liked to do this:

if (BAD_CHARS.Any(bc => someString.Contains(bc)))
    someString.Replace(bc,string.Empty); // bc is out of scope

Question: Do you have any suggestions on refactoring this algoritm, or any simpler, easier to read, performant, maintainable algorithms?

我不知道它的可读性,但正则表达式可以满足您的需要:

someString = Regex.Replace(someString, @"[!@#$%_]", "");
char[] BAD_CHARS = new char[] { '!', '@', '#', '$', '%', '_' }; //simple example
someString = string.Concat(someString.Split(BAD_CHARS,StringSplitOptions.RemoveEmptyEntries));

should do the trick (sorry for any smaller syntax errors I'm on my phone)

The string class is immutable (although a reference type), hence all its static methods are designed to return a new string variable. Calling someString.Replace without assigning it to anything will not have any effect in your program. - Seems like you fixed this problem.

The main issue with your suggested algorithm is that it repeatedly assigning many new string variables, potentially causing a big performance hit. LINQ doesn't really help things here. (I doesn't make the code significantly shorter and certainly not any more readable, in my opinion.)

Try the following extension method. The key is the use of StringBuilder , which means only one block of memory is assigned for the result during execution.

private static readonly HashSet<char> badChars = 
    new HashSet<char> { '!', '@', '#', '$', '%', '_' };

public static string CleanString(this string str)
{
    var result = new StringBuilder(str.Length);
    for (int i = 0; i < str.Length; i++)
    {
        if (!badChars.Contains(str[i]))
            result.Append(str[i]);
    }
    return result.ToString();
}

This algorithm also makes use of the .NET 3.5 'HashSet' class to give O(1) look up time for detecting a bad char. This makes the overall algorithm O(n) rather than the O(nm) of your posted one ( m being the number of bad chars); it also is lot a better with memory usage, as explained above.

This one is faster than HashSet<T> . Also, if you have to perform this action often, please consider the foundations for this question I asked here .

private static readonly bool[] BadCharValues;

static StaticConstructor()
{
    BadCharValues = new bool[char.MaxValue+1];
    char[] badChars = { '!', '@', '#', '$', '%', '_' };
    foreach (char c in badChars)
        BadCharValues[c] = true;
}

public static string CleanString(string str)
{
    var result = new StringBuilder(str.Length);
    for (int i = 0; i < str.Length; i++)
    {
        if (!BadCharValues[str[i]])
            result.Append(str[i]);
    }
    return result.ToString();
}

if you still want to do it in a LINQy way:

public static string CleanUp(this string orig)
{
    var badchars = new HashSet<char>() { '!', '@', '#', '$', '%', '_' };

    return new string(orig.Where(c => !badchars.Contains(c)).ToArray());
}

Extra tip: If you don't want to remember the array of char that are invalid for Files, you could use Path.GetInvalidFileNameChars() . If you wanted it for Paths, it's Path.GetInvalidPathChars

private static string RemoveInvalidChars(string str)
            {
                return string.Concat(str.Split(Path.GetInvalidFileNameChars(), StringSplitOptions.RemoveEmptyEntries));
            }

Why would you have REALLY LIKED to do that? The code is absolutely no simpler, you're just forcing a query extension method into your code.

As an aside, the Contains check seems redundant, both conceptually and from a performance perspective. Contains has to run through the whole string anyway, you may as well just call Replace(bad.ToString(), string.Empty) for every character and forget about whether or not it's actually present.

Of course, a regular expression is always an option, and may be more performant (if not less clear) in a situation like this.

Something to consider -- if this is for passwords (say), you want to scan for and keep good characters , and assume everything else is bad. Its easier to correctly filter or good things, then try to guess all bad things.

For Each Character If Character is Good -> Keep it (copy to out buffer, whatever.)

jeff

This is pretty clean. Restricts it to valid characters instead of removing invalid ones. You should split it to constants probably:

string clean = new string(@"Sour!ce Str&*(@ing".Where(c => 
@"abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ ,.".Contains(c)).ToArray()

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM