[英]Find similar text files
Does anyone have a particularly elegant command line (linux, OS X) way to identify "textually similar" files in a given directory? 有没有人有一个特别优雅的命令行(linux,OS X)方式来识别给定目录中的“文本类似”文件?
By "textually similar", I mean that the files should only differ in N number of lines. 通过“文本相似”,我的意思是文件应该只有N行数不同。
Here's one rough approach using unified diff
and wc
to count the different lines. 这是使用统一
diff
和wc
计算不同线条的一种粗略方法。 Grep
is used to filter out the diff context: Grep
用于过滤掉diff上下文:
diff -U 0 file1 file2 | grep -v ^@ | grep -v ^--- | grep -v ^+++ | wc -l
Using awk 使用awk
diff file1 file2 |awk '!/^<|^>|^-/{a=$0;lt[a]=0;gt[a]=0;next} # Use label (not start from <,>,---) and set the array lt and gt
/</{lt[a]++} # if has differ "<", sum it into array lt
/>/{gt[a]++} # if has differ ">", sum it into array gt
END{for (i in lt)
sum+=lt[i]>gt[i]?lt[i]:gt[i] # compare "<" or ">" lines, take the max and add in variable sum
printf "Files have differs in %d lines\n",sum # Do the print job.
if (sum<3) {print "So files are similar" }
else{print "So files are not similar"}
}'
You can define the number by yourself, for example, in my command if there are differs in two lines "if (sum<3)", I will think these files are not similar. 您可以自己定义数字,例如,在我的命令中如果两行“if(sum <3)”不同,我会认为这些文件不相似。
Test result. 测试结果。
$ cat file1
a
b
a
d
b
c
c
$ cat file2
a
b
d
b
d
c
d
f
$ diff file1 file2
3d2
< a
5a5
> d
7,8c7,8
< c
<
---
> d
> f
$ diff file1 file2 |awk '!/^<|^>|^-/{a=$0;lt[a]=0;gt[a]=0;next}/</{lt[a]++}/>/{gt[a]++}END{for (i in lt) sum+=lt[i]>gt[i]?lt[i]:gt[i];printf "Files have differs in %d lines\n",sum;if (sum<3) {print "So files are similar" }else{print "So files are not similar"}}'
Files have differs in 4 lines
So files are not similar
Maybe PMD is what your are looking for: https://pmd.github.io 也许PMD正是您所寻找的: https : //pmd.github.io
It's maintained, and the usage is simple. 它维护,使用简单。
You may want the duplicated code detection: https://pmd.github.io/pmd-5.5.5/usage/cpd-usage.html (It's not clear in your question if you target code or simple plain text, but I don't see why it shouldn't work in both case). 您可能需要重复的代码检测: https : //pmd.github.io/pmd-5.5.5/usage/cpd-usage.html (在您的问题中,如果您定位代码或简单的纯文本,我不清楚,但我不知道不知道为什么它不应该在这两种情况下都有效。
Using Terraform means having a lot of files that are copied from other files and only a few changes made. 使用Terraform意味着拥有许多从其他文件复制的文件,只进行了一些更改。 It's really frustrating to figure out where a file was copied from when you want to see what's special about it.
当您想要查看文件的特殊之处时,弄清楚文件的复制位置真的很令人沮丧。 I made a tool I call
similarities.sh
to help me identify how similar a file is to each file in a group of others. 我创建了一个名为
similarities.sh
的工具来帮助我识别文件与其他文件中每个文件的相似程度。
#!/bin/bash
fileA="$1"
shift
for fileB in "$@"; do
(
# diff once grep twice with the help of tee and stderr
diff $fileA $fileB | \
tee >(grep -cE '^< ' >&2) | \
grep -cE '^> ' >&2
# recapture stderr
) 2>&1 | (
read -d '' diffA diffB;
printf "The files %s and %s have %s:%s diffs out of %s:%s lines.\n" \
$fileA $fileB $diffA $diffB $(wc -l < $fileA) $(wc -l < $fileB)
)
done | column -t
Here it is in action: 这是在行动:
$ similarities.sh terraform.tfvars ../*/terraform.tfvars
The files terraform.tfvars and ../api_proxy/terraform.tfvars have 3:3 diffs out of 51:51 lines.
The files terraform.tfvars and ../cf-ip-location-lookup/terraform.tfvars have 4:12 diffs out of 51:59 lines.
The files terraform.tfvars and ../cf-region-cookie-setter/terraform.tfvars have 4:8 diffs out of 51:55 lines.
The files terraform.tfvars and ../cf-switch-region-origin/terraform.tfvars have 4:10 diffs out of 51:57 lines.
The files terraform.tfvars and ../reformat_devops_alerts/terraform.tfvars have 0:0 diffs out of 51:51 lines.
The files terraform.tfvars and ../restart_location/terraform.tfvars have 17:3 diffs out of 51:37 lines.
The files terraform.tfvars and ../warehouse-availability-etl/terraform.tfvars have 3:3 diffs out of 51:51 lines.
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.