简体   繁体   English

检测c#中的特殊符号

[英]Detect special symbols in c#

I'm working on ac# project in which some data contains characters which are not recognised by the encoding. 我正在研究ac #project,其中一些数据包含编码无法识别的字符。 They are displayed like that: 它们显示如下:

"Some text with special symbols in it". “有些文字 里面有特殊的 符号”。

I have no control over the encoding process, also data come from files of various origins and various formats. 我无法控制编码过程,数据也来自各种来源和各种格式的文件。 I want to be able to flag data that contains such characters as erroneous or incomplete. 我希望能够标记包含错误或不完整字符的数据。 Right now I am able to detect them this way: 现在我能够以这种方式检测它们:

if(myString.Contains("�"))
{
   //Do stuff
}

While it does work, it doesn't feel quite right to use the weird symbol directly in the Contains function. 虽然它确实有效,但在Contains函数中直接使用奇怪的符号并不合适。 Isn't there a cleaner way to do this ? 有没有更清洁的方法来做到这一点?

EDIT: 编辑:

After checking back with the team responsible for reading the files, this is how they do it: 在与负责阅读文件的团队核对后,他们就是这样做的:

var sr = new StreamReader(filePath, true);
var content = sr.ReadToEnd();

Passing true as a second parameter of StreamReader is supposed to detect the encoding from the file's BOM, and use it to read the content. 传递true作为StreamReader的第二个参数应该从文件的BOM中检测编码,并使用它来读取内容。 It doesn't always work though, as some files don't bear that information, hence why their data is read incorrectly. 它并不总是有效,因为有些文件不承载这些信息,因此他们的数据读取错误的原因。

We've made some tests and using StreamReader(filePath, Encoding.Default) instead appears to work for most if not all files we had issues with. 我们已经进行了一些测试,并且使用StreamReader(filePath, Encoding.Default)似乎适用于大多数(如果不是所有)我们遇到问题的文件。 Expectedly, files that were working before not longer work because they do not use the default encoding. 预计,之前工作的文件不再有效,因为它们不使用默认编码。

So the best solution for us would be to do the following: read the file trying to detect its encoding, then if it wasn't successful read it again with the default encoding. 因此,对我们来说最好的解决方案是执行以下操作:读取尝试检测其编码的文件,然后如果不成功则使用默认编码再次读取它。

The problem remains the same though: how do we check, after trying to detect the file's encoding, if data has been read incorrectly ? 但问题仍然存在:在尝试检测文件的编码后,如果数据读取不正确,我们如何检查?

The character is not a special symbol. 字符不是特殊符号。 It's the Unicode Replacement Character. 这是Unicode替换字符。 This means that the code tried to convert ASCII text using the wrong codepage. 这意味着代码尝试使用错误的代码页转换ASCII文本。 Any characters that didn't have a match in the codepage were replaced with . 代码页中没有匹配的任何字符都替换为 。

The solution is to read the file using the correct encoding. 解决方案是使用正确的编码读取文件。 The default encoding used by the File methods or StreamReader is UTF8. File方法或StreamReader使用的默认编码是UTF8。 You can pass a different encoding using the appropriate constructor, eg StreamReader(Stream, Encoding, Boolean) . 您可以使用适当的构造函数传递不同的编码,例如StreamReader(Stream, Encoding, Boolean) To use the system locale's codepage, you need to use Encoding.Default : 要使用系统区域设置的代码页,您需要使用Encoding.Default

var sr = new StreamReader(filePath,Encoding.Default);    

You can use the StreamReader(Stream, Encoding, Boolean) constructor to autodetect Unicode encodings from the BOM and fallback to a different encoding. 您可以使用StreamReader(Stream,Encoding,Boolean)构造函数从BOM中自动检测Unicode编码并回退到不同的编码。

Assuming the files are either some type of Unicode or match your system locale, you can use: 假设文件是​​某种类型的Unicode或与您的系统区域设置匹配,您可以使用:

var sr = new StreamReader(filePath,Encoding.Default, true);

From StreamReader's source shows that the DetectEncoding method will check the first bytes of a file to determine the encoding. 从StreamReader的源代码可以看出,DetectEncoding方法将检查文件的第一个字节以确定编码。 If one is found, it is used instead of the supplied encoding. 如果找到一个,则使用它而不是提供的编码。 The operation doesn't cause extra IO because the method checks the class's internal buffer 该操作不会导致额外的IO,因为该方法会检查类的内部缓冲区

EDIT 编辑

I just realized you can't actually load the raw file into a .NET string and still be able to have full information about the original file. 我刚刚意识到你实际上无法将原始文件加载到.NET字符串中,并且仍然能够获得有关原始文件的完整信息。

The project here uses the Mlang api which does a better job at not loading the file into a .NET string before guessing. 这里项目使用Mlang api,它在猜测之前不会将文件加载到.NET字符串中做得更好。 There is also a related SO question 还有一个相关的SO问题

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM