查找类似的文本文件

Question

Does anyone have a particularly elegant command line (linux, OS X) way to identify "textually similar" files in a given directory? 有没有人有一个特别优雅的命令行（linux，OS X）方式来识别给定目录中的“文本类似”文件？

By "textually similar", I mean that the files should only differ in N number of lines. 通过“文本相似”，我的意思是文件应该只有N行数不同。

Answer 1

Here's one rough approach using unified diff and wc to count the different lines. 这是使用统一diff和wc计算不同线条的一种粗略方法。 Grep is used to filter out the diff context: Grep用于过滤掉diff上下文：

diff -U 0  file1 file2  | grep -v ^@ | grep -v ^--- | grep -v ^+++ | wc -l

Answer 2

Using awk 使用awk

diff file1 file2 |awk '!/^<|^>|^-/{a=$0;lt[a]=0;gt[a]=0;next}    # Use label (not start from <,>,---) and set the array lt and gt
     /</{lt[a]++}                                                # if has differ "<", sum it into array lt
     />/{gt[a]++}                                                # if has differ ">", sum it into array gt
END{for (i in lt) 
       sum+=lt[i]>gt[i]?lt[i]:gt[i]                              # compare "<" or ">" lines, take the max and add in variable sum
       printf "Files have differs in %d lines\n",sum             # Do the print job.
       if (sum<3) {print "So files are similar" }
       else{print "So files are not similar"}
    }'

You can define the number by yourself, for example, in my command if there are differs in two lines "if (sum<3)", I will think these files are not similar. 您可以自己定义数字，例如，在我的命令中如果两行“if（sum <3）”不同，我会认为这些文件不相似。

Test result. 测试结果。

$ cat file1
a
b
a
d
b
c
c

$ cat file2
a
b
d
b
d
c
d
f

$ diff file1 file2
3d2
< a
5a5
> d
7,8c7,8
< c
<
---
> d
> f

$  diff file1 file2 |awk '!/^<|^>|^-/{a=$0;lt[a]=0;gt[a]=0;next}/</{lt[a]++}/>/{gt[a]++}END{for (i in lt) sum+=lt[i]>gt[i]?lt[i]:gt[i];printf "Files have differs in %d lines\n",sum;if (sum<3) {print "So files are similar" }else{print "So files are not similar"}}'

Files have differs in 4 lines
So files are not similar

Answer 3

Maybe PMD is what your are looking for: https://pmd.github.io 也许PMD正是您所寻找的： https ： //pmd.github.io

It's maintained, and the usage is simple. 它维护，使用简单。

You may want the duplicated code detection: https://pmd.github.io/pmd-5.5.5/usage/cpd-usage.html (It's not clear in your question if you target code or simple plain text, but I don't see why it shouldn't work in both case). 您可能需要重复的代码检测： https ： //pmd.github.io/pmd-5.5.5/usage/cpd-usage.html （在您的问题中，如果您定位代码或简单的纯文本，我不清楚，但我不知道不知道为什么它不应该在这两种情况下都有效。

Answer 4

Using Terraform means having a lot of files that are copied from other files and only a few changes made. 使用Terraform意味着拥有许多从其他文件复制的文件，只进行了一些更改。 It's really frustrating to figure out where a file was copied from when you want to see what's special about it. 当您想要查看文件的特殊之处时，弄清楚文件的复制位置真的很令人沮丧。 I made a tool I call similarities.sh to help me identify how similar a file is to each file in a group of others. 我创建了一个名为similarities.sh的工具来帮助我识别文件与其他文件中每个文件的相似程度。

#!/bin/bash

fileA="$1"
shift
for fileB in "$@"; do
    (
        # diff once grep twice with the help of tee and stderr
        diff $fileA $fileB | \
            tee >(grep -cE '^< ' >&2) | \
                  grep -cE '^> ' >&2
    # recapture stderr
    ) 2>&1 | (
        read -d '' diffA diffB;
        printf "The files %s and %s have %s:%s diffs out of %s:%s lines.\n" \
            $fileA $fileB $diffA $diffB $(wc -l < $fileA) $(wc -l < $fileB)
    )
done | column -t

Here it is in action: 这是在行动：

$ similarities.sh terraform.tfvars ../*/terraform.tfvars
The  files  terraform.tfvars  and  ../api_proxy/terraform.tfvars                   have  3:3   diffs  out  of  51:51  lines.
The  files  terraform.tfvars  and  ../cf-ip-location-lookup/terraform.tfvars       have  4:12  diffs  out  of  51:59  lines.
The  files  terraform.tfvars  and  ../cf-region-cookie-setter/terraform.tfvars     have  4:8   diffs  out  of  51:55  lines.
The  files  terraform.tfvars  and  ../cf-switch-region-origin/terraform.tfvars     have  4:10  diffs  out  of  51:57  lines.
The  files  terraform.tfvars  and  ../reformat_devops_alerts/terraform.tfvars      have  0:0   diffs  out  of  51:51  lines.
The  files  terraform.tfvars  and  ../restart_location/terraform.tfvars            have  17:3  diffs  out  of  51:37  lines.
The  files  terraform.tfvars  and  ../warehouse-availability-etl/terraform.tfvars  have  3:3   diffs  out  of  51:51  lines.

查找类似的文本文件

问题描述

4 个解决方案

解决方案1
1 2013-12-26 16:55:33

解决方案2
1 已采纳 2013-12-27 01:11:57

解决方案3
0 2017-03-28 13:19:15

解决方案4
0 2019-02-08 07:46:22

查找类似的文本文件

问题描述

4 个解决方案

解决方案1 1 2013-12-26 16:55:33

解决方案2 1 已采纳 2013-12-27 01:11:57

解决方案3 0 2017-03-28 13:19:15

解决方案4 0 2019-02-08 07:46:22

解决方案1
1 2013-12-26 16:55:33

解决方案2
1 已采纳 2013-12-27 01:11:57

解决方案3
0 2017-03-28 13:19:15

解决方案4
0 2019-02-08 07:46:22