简体   繁体   English

使用C#程序区分具有相同内容但格式不同的两个文本文件

[英]Differentiating between two text files having same content but in different format, using a C# program

I have two text files - they both contain the same information but are available in two different formats. 我有两个文本文件 - 它们都包含相同的信息,但有两种不同的格式。

Format 1 is having line breaks and looks well formatted. 格式1具有换行符并且看起来格式正确。 Format 2 "appears" to be continuous but in reality it also has line breaks but the line break is being represented in a very weird way. 格式2“看起来”是连续的,但实际上它也有换行符,但换行符以非常奇怪的方式表示。

https://www.dropbox.com/sh/ljlqen94a5cwza2/AAAOcuYU_EDnSLiNPRP_CDbga?dl=0 https://www.dropbox.com/sh/ljlqen94a5cwza2/AAAOcuYU_EDnSLiNPRP_CDbga?dl=0

Please refer to the attachements (LineBreak.dat and NoLineBreak.dat) In the latter file, there are line breaks but not visible - looks like some kind of transformation on the data has changed the representation. 请参阅附件(LineBreak.dat和NoLineBreak.dat)在后一个文件中,有换行但不可见 - 看起来像数据上的某种转换已经改变了表示。 If you start counting from the first position (start counting from i=0) by using the right cursor on the keyboard then at i=19 you will find that the cursor gets stuck for one press - you have to press twice to navigate to next position. 如果你从键盘上的右光标开始从第一个位置开始计数(从i = 0开始计数),那么在i = 19时你会发现光标卡住一次 - 你必须按两次导航到下一个位置。 This happens at many places in the document - I figured these are the places there were line breaks that have now been corrupted. 这发生在文档中的许多地方 - 我想这些地方有断线现在已经被破坏了。

In my business case scenario, the latter type of file is to be regarded as invalid. 在我的业务案例场景中,后一种类型的文件被视为无效。 So I need to be able to write a C# program to detect the type of file - if its in Format1 or Format2 and need help with this. 所以我需要能够编写一个C#程序来检测文件的类型 - 如果它在Format1或Format2中需要帮助。

I tried to see if the encoding on them is different by reading BOM but its the same on both files. 我试着通过读取BOM来查看它们上的编码是否不同,但两个文件上的编码相同。 I got the following BOM readings : [0]: 57 [1]: 57 [2]: 48 [3]: 54 我得到了以下BOM读数:[0]:57 [1]:57 [2]:48 [3]:54

I am using the following program to detect encoding : 我使用以下程序来检测编码:

public static void GetEncoding(string pFilePath,out Encoding pFileEncoding)
{
    // Read the BOM
    var bom = new byte[4];
    using (var file = new FileStream(pFilePath, FileMode.Open, FileAccess.Read))
    {
        file.Read(bom, 0, 4);
    }

    // Analyze the BOM
    if (bom[0] == 0x2b && bom[1] == 0x2f && bom[2] == 0x76) pFileEncoding = Encoding.UTF7;
    if (bom[0] == 0xef && bom[1] == 0xbb && bom[2] == 0xbf) pFileEncoding= Encoding.UTF8;
    if (bom[0] == 0xff && bom[1] == 0xfe) pFileEncoding =Encoding.Unicode; //UTF-16LE
    if (bom[0] == 0xfe && bom[1] == 0xff) pFileEncoding= Encoding.BigEndianUnicode; //UTF-16BE
    if (bom[0] == 0 && bom[1] == 0 && bom[2] == 0xfe && bom[3] == 0xff) pFileEncoding= Encoding.UTF32;
    pFileEncoding= Encoding.ASCII;//or Encoding.Default
}

The two files have different style Linebreaks - You can use a string replace in one of the files to make this identical. 这两个文件具有不同的样式Linebreaks - 您可以在其中一个文件中使用字符串替换来使其相同。 Try to look at https://superuser.com/questions/545461/replace-carriage-return-and-line-feed-in-notepad For a way to do it manual, but you can do this in you C# code as well just replace \\n with \\r\\n. 尝试查看https://superuser.com/questions/545461/replace-carriage-return-and-line-feed-in-notepad有关手册的方法,但您也可以在C#代码中执行此操作只需用\\ r \\ n替换\\ n。

If you want to be sure it will work everywhere you can replace \\n AND \\r\\n with Environment.NewLine 如果你想确定它可以在任何地方工作,你可以用Environment.NewLine替换\\ n AND \\ r \\ n

Hope it helps :) 希望能帮助到你 :)

The Format2 file isn't corrupt; Format2文件没有损坏; it just has unix-style line breaks (just a linefeed or \\n ) at the end of each line. 它只是在每一行的末尾有unix风格的换行符(只是一个换行符或\\n )。 The other file has windows-format line breaks (carriage return followed by linefeed or \\r\\n ). 另一个文件具有Windows格式的换行符(回车后跟换行符或\\r\\n )。

You can easily fix the latter files by checking for the existence of \\r and if none exist in the file, doing a string.Replace("\\n", "\\r\\n") across the whole file. 您可以轻松地通过检查是否存在修复后的文件\\r和,如果不存在该文件中,做一个string.Replace("\\n", "\\r\\n")在整个文件。

If you open your text file in a "potent" text editor like Notepad++ you are able to see every single byte in your file, even if it is "whitespace" ie not displayed in "normal" text editors. 如果您在“强大的”文本编辑器(如Notepad ++)中打开文本文件,则可以看到文件中的每个字节,即使它是“空格”,即不在“普通”文本编辑器中显示。

In your case you'll find out that the line breaks are "Linefeed" characters ('\\n', Dec 10, Hex 0x0A). 在你的情况下,你会发现换行符是“Linefeed”字符('\\ n',Dec 10,Hex 0x0A)。 This is the usual way to represent "New Line" in Unix systems. 这是在Unix系统中表示“新行”的常用方法。

If you want to flag such files as "invalid" just search for Carriage Return ('\\r', dec 13 Hex 0x0D) character and "Linefeed" characters. 如果要将此类文件标记为“无效”,只需搜索回车符('\\ r',dec 13 Hex 0x0D)字符和“换行”字符。

In windows text files you'll find 0x0D/0x0A Pairs 在Windows文本文件中,您将找到0x0D / 0x0A对

In Unix Files 0x0A only 仅在Unix文件0x0A中

In Apple Files 0x0D only 仅在Apple文件0x0D中

(All this has nothing to do with encodings) (所有这些与编码无关)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM