简体   繁体   中英

Removing hidden characters from within strings

My problem:

I have a .NET application that sends out newsletters via email. When the newsletters are viewed in outlook, outlook displays a question mark in place of a hidden character it can't recognize. These hidden character(s) are coming from end users who copy and paste html that makes up the newsletters into a form and submits it. A c# trim() removes these hidden chars if they occur at the end or beginning of the string. When the newsletter is viewed in gmail, gmail does a good job ignoring them. When pasting these hidden characters in a word document and I turn on the “show paragraph marks and hidden symbols” option the symbols appear as one rectangle inside a bigger rectangle. Also the text that makes up the newsletters can be in any language, so accepting Unicode chars is a must. I've tried looping through the string to detect the character but the loop doesn't recognize it and passes over it. Also asking the end user to paste the html into notepad first before submitting it is out of the question.

My question:
How can I detect and eliminate these hidden characters using C#?

You can remove all control characters from your input string with something like this:

string input; // this is your input string
string output = new string(input.Where(c => !char.IsControl(c)).ToArray());

Here is the documentation for the IsControl() method.

Or if you want to keep letters and digits only, you can also use the IsLetter and IsDigit function:

string output = new string(input.Where(c => char.IsLetter(c) || char.IsDigit(c)).ToArray());

I usually use this regular expression to replace all non-printable characters.

By the way, most of the people think that tab, line feed and carriage return are non-printable characters, but for me they are not.

So here is the expression:

string output = Regex.Replace(input, @"[^\u0009\u000A\u000D\u0020-\u007E]", "*");
  • ^ means if it's any of the following:
  • \ is tab
  • \ is linefeed
  • \ is carriage return
  • \ -\~ means everything from space to ~ -- that is, everything in ASCII.

See ASCII table if you want to make changes. Remember it would strip off every non-ASCII character.

To test above you can create a string by yourself like this:

    string input = string.Empty;

    for (int i = 0; i < 255; i++)
    {
        input += (char)(i);
    }

What best worked for me is:

string result = new string(value.Where(c =>  char.IsLetterOrDigit(c) || (c >= ' ' && c <= byte.MaxValue)).ToArray());

Where I'm making sure the character is any letter or digit, so that I don't ignore any non English letters, or if it is not a letter I check whether it's an ascii character that is greater or equal than Space to make sure I ignore some control characters, this ensures I don't ignore punctuation.

Some suggest using IsControl to check whether the character is non printable or not, but that ignores Left-To-Right mark for example.

new string(input.Where(c => !char.IsControl(c)).ToArray());

IsControl misses some control characters like left-to-right mark (LRM) (the char which commonly hides in a string while doing copy paste). If you are sure that your string has only digits and numbers then you can use IsLetterOrDigit

new string(input.Where(c => char.IsLetterOrDigit(c)).ToArray())

If your string has special characters, then

new string(input.Where(c => c < 128).ToArray())

You can do this:

var hChars = new char[] {...};
var result = new string(yourString.Where(c => !hChars.Contains(c)).ToArray());

If you know what these characters are you can use string.Replace :

newString = oldString.Replace("?", "");

where "?" represents the character you want to strip out.

The drawback with this approach is that you need to make this call repeatedly if there are multiple characters that you want to remove.

It has been a while but this haven't been answered yet.

How do you include the HMTL content in the sending code? if you are reading it from file, check the file encoding. If you are using UTF-8 with signature (the name slightly varies between editors), this is may cause the weird char at the begining of the mail.

I used this quick and dirty oneliner to clean some input from LTR/RTL marks left over by the broken Windows 10 calculator app. It's probably a far cry from perfect but good enough for a quick fix:

string cleaned = new string(input.Where(c => !char.IsControl(c) && (char.IsLetterOrDigit(c) || char.IsPunctuation(c) || char.IsSeparator(c) || char.IsSymbol(c) || char.IsWhiteSpace(c))).ToArray());

TLDR Answer

Use this Regex...

\P{Cc}\P{Cn}\P{Cs}

Like this...

var regex = new Regex(@"![\P{Cc}\P{Cn}\P{Cs}]");

TLDR Explanation

  • \\P{Cc} : Do not match control characters.
  • \\P{Cn} : Do not match unassigned characters.
  • \\P{Cs} : Do not match UTF-8-invalid characters.

Working Demo

In this demo, I use this regex to search the string "Hello, World!" . That weird character at the end is (char)4 — this is the character for END TRANSMISSION .

using System;
using System.Text.RegularExpressions;

public class Test {
    public static void Main() {
        var regex = new Regex(@"![\P{Cc}\P{Cn}\P{Cs}]");
        var matches = regex.Matches("Hello, World!" + (char)4);
        Console.WriteLine("Results: " + matches.Count);
        foreach (Match match in matches) {
            Console.WriteLine("Result: " + match);
        }
    }
}

Full Working Demo at IDEOne.com

The output from the above code:

Results: 1
Result: !

Alternatives

  • \\P{C} : Match only visible characters. Do not match any invisible characters.
  • \\P{Cc} : Match only non-control characters. Do not match any control characters.
  • \\P{Cc}\\P{Cn} : Match only non-control characters that have been assigned. Do not match any control or unassigned characters.
  • \\P{Cc}\\P{Cn}\\P{Cs} : Match only non-control characters that have been assigned and are UTF-8 valid. Do not match any control, unassigned, or UTF-8-invalid characters.
  • \\P{Cc}\\P{Cn}\\P{Cs}\\P{Cf} : Match only non-control, non-formatting characters that have been assigned and are UTF-8 valid. Do not match any control, unassigned, formatting, or UTF-8-invalid characters.

Source and Explanation

Take a look at the Unicode Character Properties available that can be used to test within a regex. You should be able to use these regexes in Microsoft .NET , JavaScript , Python , Java , PHP , Ruby , Perl , Golang , and even Adobe . Knowing Unicode character classes is very transferable knowledge , so I recommend using it!

I experienced an error with the AWS S3 SDK "Target resource path[name -3.30.2022 -15.27.00.pdf] has bidirectional characters, which are not supportedby System.Uri and thus cannot be handled by the .NET SDK"

The filename in my instance contained Unicode Character 'LEFT-TO-RIGHT MARK' (U+200E) between the dots. These were not visible in html or in Notepad++. When the text was pasted into Visual Studio 2019 Editor, the unicode text was visible and I was able to solve the issue.

U+200E 从左到右标记

The problem was solved by replacing all control and other non-printable characters from the filename using the following script.

var input = Regex.Replace(s, @"\p{C}+", string.Empty);

Credit Source: https://stackoverflow.com/a/40568888/1165173

string output = new string(input.Where(c => !char.IsControl(c)).ToArray());

This will surely solve the problem. I had a non printable substitute characer(ASCII 26) in a string which was causing my app to break and this line of code removed the characters

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM