简体   繁体   English

从字符串中删除隐藏字符

[英]Removing hidden characters from within strings

My problem:我的问题:

I have a .NET application that sends out newsletters via email. When the newsletters are viewed in outlook, outlook displays a question mark in place of a hidden character it can't recognize.我有一个 .NET 应用程序,它通过 email 发送时事通讯。当在 outlook 中查看时事通讯时,outlook 显示一个问号代替它无法识别的隐藏字符。 These hidden character(s) are coming from end users who copy and paste html that makes up the newsletters into a form and submits it.这些隐藏字符来自最终用户,他们将组成时事通讯的 html 复制并粘贴到表单中并提交。 A c# trim() removes these hidden chars if they occur at the end or beginning of the string. c# trim() 会删除这些隐藏的字符(如果它们出现在字符串的末尾或开头)。 When the newsletter is viewed in gmail, gmail does a good job ignoring them.当在 gmail 中查看时事通讯时,gmail 会很好地忽略它们。 When pasting these hidden characters in a word document and I turn on the “show paragraph marks and hidden symbols” option the symbols appear as one rectangle inside a bigger rectangle.在 word 文档中粘贴这些隐藏字符时,我打开“显示段落标记和隐藏符号”选项,这些符号显示为一个更大矩形内的一个矩形。 Also the text that makes up the newsletters can be in any language, so accepting Unicode chars is a must.此外,构成时事通讯的文本可以使用任何语言,因此必须接受 Unicode 个字符。 I've tried looping through the string to detect the character but the loop doesn't recognize it and passes over it.我试过循环遍历字符串来检测字符,但循环无法识别它并通过它。 Also asking the end user to paste the html into notepad first before submitting it is out of the question.还要求最终用户在提交之前先将 html 粘贴到记事本中,这是不可能的。

My question:我的问题:
How can I detect and eliminate these hidden characters using C#?如何使用 C# 检测并消除这些隐藏字符?

You can remove all control characters from your input string with something like this:您可以使用以下内容从输入字符串中删除所有控制字符:

string input; // this is your input string
string output = new string(input.Where(c => !char.IsControl(c)).ToArray());

Here is the documentation for the IsControl() method. 这是IsControl()方法的文档

Or if you want to keep letters and digits only, you can also use the IsLetter and IsDigit function:或者,如果您只想保留字母和数字,也可以使用IsLetterIsDigit函数:

string output = new string(input.Where(c => char.IsLetter(c) || char.IsDigit(c)).ToArray());

I usually use this regular expression to replace all non-printable characters.我通常使用这个正则表达式来替换所有不可打印的字符。

By the way, most of the people think that tab, line feed and carriage return are non-printable characters, but for me they are not.顺便说一句,大多数人认为制表符、换行符和回车符是不可打印的字符,但对我来说不是。

So here is the expression:所以这里是表达式:

string output = Regex.Replace(input, @"[^\u0009\u000A\u000D\u0020-\u007E]", "*");
  • ^ means if it's any of the following: ^表示是否属于以下任何一种情况:
  • \ is tab \ 是制表符
  • \ is linefeed \ 是换行
  • \ is carriage return \ 是回车
  • \ -\~ means everything from space to ~ -- that is, everything in ASCII. \ -\~表示从空格到~所有内容——即 ASCII 中的所有内容。

See ASCII table if you want to make changes.如果要进行更改,请参阅ASCII 表 Remember it would strip off every non-ASCII character.请记住,它会去除每个非 ASCII 字符。

To test above you can create a string by yourself like this:要进行上面的测试,您可以像这样自己创建一个字符串:

    string input = string.Empty;

    for (int i = 0; i < 255; i++)
    {
        input += (char)(i);
    }

What best worked for me is:最适合我的是:

string result = new string(value.Where(c =>  char.IsLetterOrDigit(c) || (c >= ' ' && c <= byte.MaxValue)).ToArray());

Where I'm making sure the character is any letter or digit, so that I don't ignore any non English letters, or if it is not a letter I check whether it's an ascii character that is greater or equal than Space to make sure I ignore some control characters, this ensures I don't ignore punctuation.在我确保字符是任何字母或数字的地方,这样我就不会忽略任何非英文字母,或者如果它不是一个字母,我会检查它是否是一个大于或等于 Space 的 ascii 字符以确保我忽略了一些控制字符,这确保我不会忽略标点符号。

Some suggest using IsControl to check whether the character is non printable or not, but that ignores Left-To-Right mark for example.有些人建议使用 IsControl 检查字符是否不可打印,但例如忽略从左到右标记。

new string(input.Where(c => !char.IsControl(c)).ToArray());

IsControl misses some control characters like left-to-right mark (LRM) (the char which commonly hides in a string while doing copy paste). IsControl 遗漏了一些控制字符,如从左到右标记 (LRM)(在执行复制粘贴时通常隐藏在字符串中的字符)。 If you are sure that your string has only digits and numbers then you can use IsLetterOrDigit如果您确定您的字符串只有数字和数字,那么您可以使用 IsLetterOrDigit

new string(input.Where(c => char.IsLetterOrDigit(c)).ToArray())

If your string has special characters, then如果你的字符串有特殊字符,那么

new string(input.Where(c => c < 128).ToArray())

You can do this:你可以这样做:

var hChars = new char[] {...};
var result = new string(yourString.Where(c => !hChars.Contains(c)).ToArray());

If you know what these characters are you can use string.Replace : 如果你知道这些字符是什么,你可以使用string.Replace

newString = oldString.Replace("?", "");

where "?" 在哪里“?” represents the character you want to strip out. 表示要删除的字符。

The drawback with this approach is that you need to make this call repeatedly if there are multiple characters that you want to remove. 这种方法的缺点是,如果要删除多个字符,则需要重复进行此调用。

It has been a while but this haven't been answered yet. 已经有一段时间了,但尚未得到回答。

How do you include the HMTL content in the sending code? 如何在发送代码中包含HMTL内容? if you are reading it from file, check the file encoding. 如果您从文件中读取它,请检查文件编码。 If you are using UTF-8 with signature (the name slightly varies between editors), this is may cause the weird char at the begining of the mail. 如果您使用带签名的UTF-8(名称在编辑器之间略有不同),这可能会导致邮件开头的奇怪字符。

I used this quick and dirty oneliner to clean some input from LTR/RTL marks left over by the broken Windows 10 calculator app.我使用这个快速而肮脏的 oneliner 来清除损坏的 Windows 10 计算器应用程序留下的 LTR/RTL 标记中的一些输入。 It's probably a far cry from perfect but good enough for a quick fix:这可能与完美相去甚远,但足以快速修复:

string cleaned = new string(input.Where(c => !char.IsControl(c) && (char.IsLetterOrDigit(c) || char.IsPunctuation(c) || char.IsSeparator(c) || char.IsSymbol(c) || char.IsWhiteSpace(c))).ToArray());

TLDR Answer TLDR 答案

Use this Regex...使用这个正则表达式...

\P{Cc}\P{Cn}\P{Cs}

Like this...像这样...

var regex = new Regex(@"![\P{Cc}\P{Cn}\P{Cs}]");

TLDR Explanation TLDR 解释

  • \\P{Cc} : Do not match control characters. \\P{Cc}匹配控制字符。
  • \\P{Cn} : Do not match unassigned characters. \\P{Cn}匹配未分配的字符。
  • \\P{Cs} : Do not match UTF-8-invalid characters. \\P{Cs}匹配 UTF-8 无效字符。

Working Demo工作演示

In this demo, I use this regex to search the string "Hello, World!"在这个演示中,我使用这个正则表达式来搜索字符串"Hello, World!" . . That weird character at the end is (char)4 — this is the character for END TRANSMISSION .最后那个奇怪的字符是(char)4 — 这是END TRANSMISSION的字符。

using System;
using System.Text.RegularExpressions;

public class Test {
    public static void Main() {
        var regex = new Regex(@"![\P{Cc}\P{Cn}\P{Cs}]");
        var matches = regex.Matches("Hello, World!" + (char)4);
        Console.WriteLine("Results: " + matches.Count);
        foreach (Match match in matches) {
            Console.WriteLine("Result: " + match);
        }
    }
}

Full Working Demo at IDEOne.com IDEOne.com 上的完整工作演示

The output from the above code:上述代码的输出:

Results: 1
Result: !

Alternatives备择方案

  • \\P{C} : Match only visible characters. \\P{C} :只匹配可见字符。 Do not match any invisible characters.不匹配任何不可见字符。
  • \\P{Cc} : Match only non-control characters. \\P{Cc} :只匹配非控制字符。 Do not match any control characters.不匹配任何控制字符。
  • \\P{Cc}\\P{Cn} : Match only non-control characters that have been assigned. \\P{Cc}\\P{Cn} :仅匹配已分配的非控制字符。 Do not match any control or unassigned characters.不匹配任何控制或未分配的字符。
  • \\P{Cc}\\P{Cn}\\P{Cs} : Match only non-control characters that have been assigned and are UTF-8 valid. \\P{Cc}\\P{Cn}\\P{Cs} :仅匹配已分配且 UTF-8 有效的非控制字符。 Do not match any control, unassigned, or UTF-8-invalid characters.不匹配任何控制字符、未分配字符或 UTF-8 无效字符。
  • \\P{Cc}\\P{Cn}\\P{Cs}\\P{Cf} : Match only non-control, non-formatting characters that have been assigned and are UTF-8 valid. \\P{Cc}\\P{Cn}\\P{Cs}\\P{Cf} :仅匹配已分配且 UTF-8 有效的非控制、非格式化字符。 Do not match any control, unassigned, formatting, or UTF-8-invalid characters.不匹配任何控制、未分配、格式或 UTF-8 无效字符。

Source and Explanation来源和解释

Take a look at the Unicode Character Properties available that can be used to test within a regex.查看可用于在正则表达式中进行测试的Unicode 字符属性 You should be able to use these regexes in Microsoft .NET , JavaScript , Python , Java , PHP , Ruby , Perl , Golang , and even Adobe .您应该能够在Microsoft .NETJavaScriptPythonJavaPHPRubyPerlGolang甚至Adobe 中使用这些正则表达式。 Knowing Unicode character classes is very transferable knowledge , so I recommend using it!了解 Unicode 字符类是非常可迁移的知识,所以我推荐使用它!

I experienced an error with the AWS S3 SDK "Target resource path[name -3.30.2022 -15.27.00.pdf] has bidirectional characters, which are not supportedby System.Uri and thus cannot be handled by the .NET SDK"我遇到 AWS S3 SDK 错误“目标资源路径 [名称 -3.30.2022 -15.27.00.pdf] 具有双向字符,System.Uri 不支持这些字符,因此 .NET SDK 无法处理”

The filename in my instance contained Unicode Character 'LEFT-TO-RIGHT MARK' (U+200E) between the dots.我实例中的文件名在点之间包含 Unicode 个字符“LEFT-TO-RIGHT MARK”(U+200E)。 These were not visible in html or in Notepad++.这些在 html 或 Notepad++ 中不可见。 When the text was pasted into Visual Studio 2019 Editor, the unicode text was visible and I was able to solve the issue.当文本被粘贴到 Visual Studio 2019 编辑器中时,unicode 文本可见,我能够解决这个问题。

U+200E 从左到右标记

The problem was solved by replacing all control and other non-printable characters from the filename using the following script.通过使用以下脚本替换文件名中的所有控制字符和其他不可打印字符,问题得以解决。

var input = Regex.Replace(s, @"\p{C}+", string.Empty);

Credit Source: https://stackoverflow.com/a/40568888/1165173来源: https://stackoverflow.com/a/40568888/1165173

string output = new string(input.Where(c => !char.IsControl(c)).ToArray());

This will surely solve the problem.这肯定会解决问题。 I had a non printable substitute characer(ASCII 26) in a string which was causing my app to break and this line of code removed the characters我在一个字符串中有一个不可打印的替代字符(ASCII 26),这导致我的应用程序中断,这行代码删除了这些字符

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM