比较linux终端中的两个文件

Question

There are two files called "a.txt" and "b.txt" both have a list of words.有两个名为“a.txt”和“b.txt”的文件都有一个单词列表。 Now I want to check which words are extra in "a.txt" and are not in "b.txt" .现在我想检查哪些单词在"a.txt" 中是多余的，而不是在"b.txt" 中。

I need a efficient algorithm as I need to compare two dictionaries.我需要一个有效的算法，因为我需要比较两个字典。

Answer 1

if you have vim installed,try this:如果你安装了 vim，试试这个：

vimdiff file1 file2

or要么

vim -d file1 file2

you will find it fantastic.你会发现它很棒。 在此处输入图片说明

Answer 2

Sort them and use comm :对它们进行排序并使用comm ：

comm -23 <(sort a.txt) <(sort b.txt)

comm compares (sorted) input files and by default outputs three columns: lines that are unique to a, lines that are unique to b, and lines that are present in both. comm比较（排序）输入文件并默认输出三列：a 独有的行、b 独有的行以及两者都存在的行。 By specifying -1 , -2 and/or -3 you can suppress the corresponding output.通过指定-1 、 -2和/或-3可以抑制相应的输出。 Therefore comm -23 ab lists only the entries that are unique to a.因此comm -23 ab只列出comm -23 ab唯一的条目。 I use the <(...) syntax to sort the files on the fly, if they are already sorted you don't need this.我使用<(...)语法对文件进行动态排序，如果它们已经排序，则不需要它。

Answer 3

If you prefer the diff output style from git diff , you can use it with the --no-index flag to compare files not in a git repository:如果您更喜欢git diff的 diff 输出样式，您可以将它与--no-index标志一起使用来比较不在 git 存储库中的文件：

git diff --no-index a.txt b.txt

Using a couple of files with around 200k file name strings in each, I benchmarked (with the built-in time command) this approach vs some of the other answers here:使用几个文件，每个文件都有大约 20 万个文件名字符串，我对此方法进行了基准测试（使用内置time命令）与这里的其他一些答案：

git diff --no-index a.txt b.txt
# ~1.2s

comm -23 <(sort a.txt) <(sort b.txt)
# ~0.2s

diff a.txt b.txt
# ~2.6s

sdiff a.txt b.txt
# ~2.7s

vimdiff a.txt b.txt
# ~3.2s

comm seems to be the fastest by far, while git diff --no-index appears to be the fastest approach for diff-style output. comm似乎是迄今为止最快的，而git diff --no-index似乎是 diff 样式输出的最快方法。

Update 2018-03-25 You can actually omit the --no-index flag unless you are inside a git repository and want to compare untracked files within that repository. 2018 年 3 月 25 日更新，您实际上可以省略--no-index标志，除非您在 git 存储库中并想要比较该存储库中未跟踪的文件。 From the man pages :从手册页：

This form is to compare the given two paths on the filesystem.这种形式是比较文件系统上给定的两个路径。 You can omit the --no-index option when running the command in a working tree controlled by Git and at least one of the paths points outside the working tree, or when running the command outside a working tree controlled by Git.在 Git 控制的工作树中运行命令并且至少有一个路径指向工作树之外，或者在 Git 控制的工作树之外运行命令时，您可以省略 --no-index 选项。

Answer 4

尝试sdiff ( man sdiff )

sdiff -s file1 file2

Answer 5

You can use diff tool in linux to compare two files.您可以在 linux 中使用diff工具来比较两个文件。 You can use --changed-group-format and --unchanged-group-format options to filter required data.您可以使用--changed-group-format和--unchanged-group-format选项来过滤所需的数据。

Following three options can use to select the relevant group for each option:以下三个选项可用于为每个选项选择相关组：

'%<' get lines from FILE1 '%<' 从 FILE1 中获取行
'%>' get lines from FILE2 '%>' 从 FILE2 中获取行
'' (empty string) for removing lines from both files. ''（空字符串）用于从两个文件中删除行。

Eg: diff --changed-group-format="%<" --unchanged-group-format="" file1.txt file2.txt例如： diff --changed-group-format="%<" --unchanged-group-format="" file1.txt file2.txt

[root@vmoracle11 tmp]# cat file1.txt 
test one
test two
test three
test four
test eight
[root@vmoracle11 tmp]# cat file2.txt 
test one
test three
test nine
[root@vmoracle11 tmp]# diff --changed-group-format='%<' --unchanged-group-format='' file1.txt file2.txt 
test two
test four
test eight

Answer 6

You can also use: colordiff : Displays the output of diff with colors.您还可以使用： colordiff ：用颜色显示 diff 的输出。

About vimdiff : It allows you to compare files via SSH, for example :关于vimdiff ：它允许您通过 SSH 比较文件，例如：

vimdiff /var/log/secure scp://192.168.1.25/var/log/secure

Extracted from: http://www.sysadmit.com/2016/05/linux-diferencias-entre-dos-archivos.html摘自： http : //www.sysadmit.com/2016/05/linux-diferencias-entre-dos-archivos.html

Answer 7

Also, do not forget about mcdiff - Internal diff viewer of GNU Midnight Commander .另外，不要忘记mcdiff - GNU Midnight Commander 的内部差异查看器。

For example:例如：

mcdiff file1 file2

Enjoy!享受！

Answer 8

Use comm -13 (requires sorted files) :使用comm -13 （需要排序文件） ：

$ cat file1
one
two
three

$ cat file2
one
two
three
four

$ comm -13 <(sort file1) <(sort file2)
four

Answer 9

Here is my solution for this :这是我的解决方案：

mkdir temp
mkdir results
cp /usr/share/dict/american-english ~/temp/american-english-dictionary
cp /usr/share/dict/british-english ~/temp/british-english-dictionary
cat ~/temp/american-english-dictionary | wc -l > ~/results/count-american-english-dictionary
cat ~/temp/british-english-dictionary | wc -l > ~/results/count-british-english-dictionary
grep -Fxf ~/temp/american-english-dictionary ~/temp/british-english-dictionary > ~/results/common-english
grep -Fxvf ~/results/common-english ~/temp/american-english-dictionary > ~/results/unique-american-english
grep -Fxvf ~/results/common-english ~/temp/british-english-dictionary > ~/results/unique-british-english

Answer 10

You can also use:您还可以使用：

sdiff file1 file2

To display differences side by side within your terminal!在终端中并排显示差异！

Answer 11

diff a.txt b.txt | grep '<'

然后可以通过管道切割以获得干净的输出

diff a.txt b.txt | grep '<' | cut -c 3

Answer 12

Using awk for it.使用awk。 Test files:测试文件：

$ cat a.txt
one
two
three
four
four
$ cat b.txt
three
two
one

The awk: awk：

$ awk '
NR==FNR {                    # process b.txt  or the first file
    seen[$0]                 # hash words to hash seen
    next                     # next word in b.txt
}                            # process a.txt  or all files after the first
!($0 in seen)' b.txt a.txt   # if word is not hashed to seen, output it

Duplicates are outputed:输出重复项：

four
four

To avoid duplicates, add each newly met word in a.txt to seen hash:为避免重复，将 a.txt 中每个新遇到的单词添加到seen哈希中：

$ awk '
NR==FNR {
    seen[$0]
    next
}
!($0 in seen) {              # if word is not hashed to seen
    seen[$0]                 # hash unseen a.txt words to seen to avoid duplicates 
    print                    # and output it
}' b.txt a.txt

Output:输出：

four

If the word lists are comma-separated, like:如果单词列表以逗号分隔，例如：

$ cat a.txt
four,four,three,three,two,one
five,six
$ cat b.txt
one,two,three

you have to do a couple of extra laps ( for loops):你必须多跑几圈（ for循环）：

awk -F, '                    # comma-separated input
NR==FNR {
    for(i=1;i<=NF;i++)       # loop all comma-separated fields
        seen[$i]
    next
}
{
    for(i=1;i<=NF;i++)
        if(!($i in seen)) {
             seen[$i]        # this time we buffer output (below):
             buffer=buffer (buffer==""?"":",") $i
        }
    if(buffer!="") {         # output unempty buffers after each record in a.txt
        print buffer
        buffer=""
    }
}' b.txt a.txt

Output this time:这次输出：

four
five,six

比较linux终端中的两个文件

问题描述

12 个解决方案

解决方案1
372 2014-02-13 09:10:26

解决方案2
78 2013-01-24 11:56:23

解决方案3
36 2017-10-15 14:16:47

解决方案4
34 2014-12-27 12:22:17

解决方案5
31 2013-01-24 11:57:16

解决方案6
9 2016-05-16 08:18:07

解决方案7
6 2018-06-06 12:34:15

解决方案8
4 2013-01-24 11:58:05

解决方案9
1 已采纳 2013-01-24 13:28:24

解决方案10
0 2021-02-11 18:08:14

解决方案11
0 2021-12-10 00:04:17

解决方案12
-1 2019-10-03 08:04:10

比较linux终端中的两个文件

问题描述

12 个解决方案

解决方案1 372 2014-02-13 09:10:26

解决方案2 78 2013-01-24 11:56:23

解决方案3 36 2017-10-15 14:16:47

解决方案4 34 2014-12-27 12:22:17

解决方案5 31 2013-01-24 11:57:16

解决方案6 9 2016-05-16 08:18:07

解决方案7 6 2018-06-06 12:34:15

解决方案8 4 2013-01-24 11:58:05

解决方案9 1 已采纳 2013-01-24 13:28:24

解决方案10 0 2021-02-11 18:08:14

解决方案11 0 2021-12-10 00:04:17

解决方案12 -1 2019-10-03 08:04:10

解决方案1
372 2014-02-13 09:10:26

解决方案2
78 2013-01-24 11:56:23

解决方案3
36 2017-10-15 14:16:47

解决方案4
34 2014-12-27 12:22:17

解决方案5
31 2013-01-24 11:57:16

解决方案6
9 2016-05-16 08:18:07

解决方案7
6 2018-06-06 12:34:15

解决方案8
4 2013-01-24 11:58:05

解决方案9
1 已采纳 2013-01-24 13:28:24

解决方案10
0 2021-02-11 18:08:14

解决方案11
0 2021-12-10 00:04:17

解决方案12
-1 2019-10-03 08:04:10