简体   繁体   中英

Removing control characters from a UTF-8 string

I found this question but it removes all valid utf-8 characters also (returns me a blank string, while there are valid utf-8 characters plus control characters). As I read about utf-8 , there's not a specific range for control characters and each character set has its own control characters .

How can I modify above solution to only remove control characters ?

This is how I roll:

Regex.Replace(evilWeirdoText, @"[\u0000-\u001F]", string.Empty)

This strips out all the first 31 control characters. The next hex value up from \ is \ AKA the space. Everything before space is all the line feed and null nonsense.

To believe me on the characters: http://donsnotes.com/tech/charsets/ascii.html

I think the following code will work for you:

public static string RemoveControlCharacters(string inString)
{
    if (inString == null) return null;
    StringBuilder newString = new StringBuilder();
    char ch;
    for (int i = 0; i < inString.Length; i++)
    {
        ch = inString[i];
        if (!char.IsControl(ch))
        {
            newString.Append(ch);
        }
    }
    return newString.ToString();
}

If you plan to use the string as a query string, you should consider using the Uri.EscapeUriString() or Uri.EscapeDataString() before sending it out. Note: You might still need to pull out anything from char.IsControl() first?

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM