Removing control characters from a UTF-8 string

Question

I found this question but it removes all valid utf-8 characters also (returns me a blank string, while there are valid utf-8 characters plus control characters). As I read about utf-8 , there's not a specific range for control characters and each character set has its own control characters .

How can I modify above solution to only remove control characters ?

Answer 1

This is how I roll:

Regex.Replace(evilWeirdoText, @"[\u0000-\u001F]", string.Empty)

This strips out all the first 31 control characters. The next hex value up from \ is \ AKA the space. Everything before space is all the line feed and null nonsense.

To believe me on the characters: http://donsnotes.com/tech/charsets/ascii.html

Answer 2

I think the following code will work for you:

public static string RemoveControlCharacters(string inString)
{
    if (inString == null) return null;
    StringBuilder newString = new StringBuilder();
    char ch;
    for (int i = 0; i < inString.Length; i++)
    {
        ch = inString[i];
        if (!char.IsControl(ch))
        {
            newString.Append(ch);
        }
    }
    return newString.ToString();
}

Answer 3

If you plan to use the string as a query string, you should consider using the Uri.EscapeUriString() or Uri.EscapeDataString() before sending it out. Note: You might still need to pull out anything from char.IsControl() first?

Removing control characters from a UTF-8 string

Question

3 answers

solution1
19 2014-04-02 07:12:40

solution2
19 ACCPTED 2011-07-23 10:03:12

solution3
0 2013-01-04 22:17:06

Removing control characters from a UTF-8 string

Question

3 answers

solution1 19 2014-04-02 07:12:40

solution2 19 ACCPTED 2011-07-23 10:03:12

solution3 0 2013-01-04 22:17:06

solution1
19 2014-04-02 07:12:40

solution2
19 ACCPTED 2011-07-23 10:03:12

solution3
0 2013-01-04 22:17:06