简体   繁体   English

与Linux“ comm”工具比较文件时如何显示行号

[英]How to display line numbers when comparing files with linux “comm” tool

I would like to diff two very large files (multi-GB), using linux command line tools, and see the line numbers of the differences. 我想使用linux命令行工具来比较两个非常大的文件(多GB),并查看差异的行号。 The order of the data matters. 数据的顺序很重要。

I am running on a Linux machine and the standard diff tool gives me the "memory exhausted" error. 我在Linux机器上运行,标准的diff工具给我“内存耗尽”错误。 -H had no effect. -H没有作用。

In my application, I only need to stream the diff results. 在我的应用程序中,我只需要流比较结果。 That is, I just want to visually look at the first few differences, I don't need to inspect the entire file. 也就是说,我只想直观地查看前几个差异,而无需检查整个文件。 If there are differences, a quick glance will tell me what is wrong. 如果存在差异,请快速浏览一下会告诉我出了什么问题。

'comm' seems well suited to this, but it does not display line numbers of the differences. 'comm'似乎很适合此操作,但是它不显示差异的行号。

In general, my multi-GB files only have a few hundred lines that are different, the rest of the file is the same. 通常,我的多GB文件只有几百行是不同的,文件的其余部分是相同的。

Is there a way to get comm to dump the line number? 有没有办法让comm转储行号? Or a way to make diff run without loading the entire file into memory? 还是一种使差异运行而无需将整个文件加载到内存的方法? (like cutting the input files into 1k blocks, without actually creating a million 1k-files in my filesystem and cluttering everything up)? (例如将输入文件切成1k块,而实际上没有在我的文件系统中创建一百万个1k文件并使所有内容杂乱无章)?

I won't use comm , but as you said WHAT you need, in addition to HOW you thought you should do it, I'll focus on the "WHAT you need" instead : 我不会使用comm ,但是正如您所说的那样,除了您认为应该如何做之外,我将重点放在“您需要什么”上:

An interesting way would be to use paste and awk : paste can show 2 files "side by side" using a separator. 一种有趣的方式是使用pasteawkpaste可以使用分隔符“并排”显示2个文件。 If you use \\n as separator, it display the 2 files with line 1 of each , followed by line 2 of each etc. 如果使用\\n作为分隔符,它将显示2个文件,每个文件的第1行,然后是每个文件的第2行。

So the script you could use could be simply (once you know that there are the same number of lines in each files) : 因此,您可以使用的脚本很简单(一旦您知道每个文件中的行数相同):

 paste -d '\n' /tmp/file1  /tmp/file2 | awk '
        NR%2  { linefirstfile=$0 ; } 
      !(NR%2) { if ( $0 != linefirstfile )
                      { print "line",NR/2,": "; print linefirstfile ; print $0 ; } }'

(Interrestingly, this solution will allow be easily extended to do a diff of N files in a single read, whatever the sizes of the N files are ... just adding a check that all have the same amount of lines before doing the comparison steps (otherwise "paste" will in the end show only lines from the bigger files)) (有趣的是,无论N文件的大小如何,此解决方案都可以轻松扩展为一次读取N个文件的差异...只需在执行比较步骤之前添加一个检查所有行数是否相同的检查即可) (否则,“粘贴”将仅显示较大文件中的行)

Here is a (short) example, to show how it works: 这是一个(简短的)示例,以显示其工作方式:

$ cat > /tmp/file1
A
C %FORGOT% fmsdflmdflskdf dfldksdlfkdlfkdlkf
E

$ cat > /tmp/file2
A
C sdflmsdflmsdfsklmdfksdmfksd fmsdflmdflskdf dfldksdlfkdlfkdlkf
E

$ paste -d '\n' /tmp/file1 /tmp/file2
A
A
C %FORGOT% fmsdflmdflskdf dfldksdlfkdlfkdlkf
C sdflmsdflmsdfsklmdfksdmfksd fmsdflmdflskdf dfldksdlfkdlfkdlkf
E
E

$ paste -d '\n' /tmp/file1 /tmp/file2 | awk '
     NR%2  { linefirstfile=$0 ; }
   !(NR%2) { if ( $0 != linefirstfile ) 
               { print "line",NR/2,": "; print linefirstfile ; print $0 ; } }'
line 2 :
C %FORGOT% fmsdflmdflskdf dfldksdlfkdlfkdlkf
C sdflmsdflmsdfsklmdfksdmfksd fmsdflmdflskdf dfldksdlfkdlfkdlkf

If it happens that the files don't have the same amount of lines, then you can add first a check of the number of line, comparing $(wc -l /tmp/file1) and $(wc -l /tmp/file2) , and only do the past...|awk if they have the same amount of line, to ensure the "paste" works correctly by always having one line of each! 如果碰巧文件的行数不相同,则可以先添加行数检查,然后comparing $(wc -l /tmp/file1)$(wc -l /tmp/file2) ,并且只有在行数相同的情况下才执行过去... | awk,以确保每行始终只有一行来确保“粘贴”正确工作! (But of course, in that case, there will be one (fast!) entire read of each file...) (但是,在这种情况下,当然,每个文件将被完整读取一次……)

You can easily adjust it to display exactly as you need it to. 您可以轻松地对其进行调整,以使其完全根据需要显示。 And you could quit after the Nth difference (either automatically, with a counter in the awk loop, or by pressing CTRL-C when you saw enough) 并且您可以在第N个差异之后退出(或者自动运行,在awk循环中使用一个计数器,或者在看到足够数量时按CTRL-C)

Which versions of diff have you tried? 您尝试过哪个版本的diff? GNU diff has a "--speed-large-files" which may help. GNU diff有一个“ --speed-large-files”,可能会有所帮助。

The comm tool assumes the lines are sorted. 通讯工具假定行已排序。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM