简体   繁体   中英

Compare `n` plaintext files and print number of unique lines per file

I have n number of plaintext files with lines of text in them.
Some lines are duplicated between some of the files.
Is there a method in bash where I can compare the files and print out how many unique lines each file has when compared to the other files?

Example:

# file1
1
2
3
10

# file2
2
10
50
3

# file3
100
2
1
40
6

I'm basically looking for a solution that would say something similar to:
$filename:$unique_lines

One using grep , sort , tr and uniq , n >1:

$ grep ^ file[123] | tr : ' ' | sort -k2 | uniq -f 1 -u
file3 100
file3 40
file2 50
file3 6

Another using GNU awk:

$ awk '{
    a[$0]++
    f[FILENAME][FNR]=$0
}
END {
    for(i in f)
        for(j in f[i])
            if(a[f[i][j]]==1)
                print i,f[i][j]
}' file[123]
file2 50
file3 100
file3 40
file3 6

For any two files, say file1 and file2 , you can output the unique lines in file1 (ie, lines in file1 that do not appear in file2 ), as follows:

> fgrep -vx -f file2 file1
1

Other examples using your file1 , file2 , and file3 :

> fgrep -vx -f file3 file1  # Show lines in file1 that do not appear in file3
3
10

> fgrep -vx -f file2 file3  # Show lines in file3 that do not appear in file2
100
1
40
6

Note that on most if not all systems, fgrep is really just a synonym for grep -F , where the -F tells grep to compare fixed strings instead of trying to match a regular expression. So if you don't have fgrep for some reason, you should be able to use grep -Fvx instead of fgrep -vx .

With multiple files to compare against, it gets trickier, but for any given file, you can keep a running list of unique lines in a temporary file, and then whittle it down by comparing the temp file to each other file one at a time:

# Show all lines in file3 that do not exist in file1 or file2
fgrep -vx -f file1 file3 > file3_unique
fgrep -vx -f file2 file3_unique
100
40
6

Since all you want is a count of the number of unique lines, you can just pipe that last command to wc -l :

> fgrep -vx -f file2 file3_unique | wc -l
3

If you do this with more than 3 files, you will find that you need to use an extra temp file. Let's suppose you had a file4 :

> cat file4
1
3
40
6

That means you would need a third fgrep command to finish whittling down the list of unique lines. If you just do this, you'll run into a problem:

# Show all lines in file3 that do not exist in file1, file2, or file4
> fgrep -vx -f file1 file3         > file3_unique
> fgrep -vx -f file2 file3_unique  > file3_unique
grep: input file 'file3_unique' is also the output 

In other words, you can't pipe the results back to the same file that's being grep -ed. So you need to output to a separate temp file each time, and then rename it afterwards:

# Show all lines in file3 that do not exist in file1, file2, or file4
> fgrep -vx -f file1 file3         > temp
> mv temp file3_unique
> fgrep -vx -f file2 file3_unique  > temp
> mv temp file3_unique
> fgrep -vx -f file4 file3_unique
100

Note that I left off the | wc -l | wc -l on the last line just to show that it works as expected.

Of course, if your number of files is arbitrary, you'll want to do the comparisons in a loop:

files=( file* )
for ((i=0; i<${#files[@]}; ++i)); do
  cp -f "${files[i]}" unique
  for ((j=0; j<${#files[@]}; ++j)); do
     if (( j != i )); then
       fgrep -vx -f "${files[j]}" unique > temp
       mv temp unique
     fi
  done
  echo "${files[i]}:$(wc -l <unique)"
  rm unique
done

This would produce the output:

file1:0
file2:1
file3:1
file4:0

If temp and unique are existing files or directories, you might want to consider using mktemp instead. For example:

unique=$(mktemp)
temp=$(mktemp)

fgrep -vx file2 file3 > "$temp"
mv "$temp" "$unique"

This way, the actual files will be something like /tmp/tmp.rFItj3sHVQ , etc, and you won't accidentally overwrite anything named temp or unique in the directory where you run this code.

Update : Just for kicks, I decided to shrink this down a bit. For one thing, I'm not overly fond of the nested loop, or the temporary files. Here's a version that gets rid of both. This improvement is based on the observation that whittling down, say, file1 by comparing against file2 , file3 , and file4 in succession is the same thing as doing a single comparison between file1 and the concatenation of file2 + file3 + file4 . The trick then is just figuring out how to concatenate every other file without looping. But it turns out you can actually do that fairly easily in bash with array splicing. For example:

files=( file1 file2 file3 file4 )

# Concatenate all files *except* ${files[2]}, i.e., file3
> cat "${files[@]:0:2}" "${files[@]:3}"
1
2
3
10
2
10
50
3
1
3
40
6

Combining this with the previous solution, we can replace the inner loop, and the temp files, with a single line:

files=(file1 file2 file3 file4)
for ((i=0; i<${#files[@]}; ++i)); do
  echo "${files[i]}:$(fgrep -vxc -f <(cat "${files[@]:0:i}" "${files[@]:i+1}") <(sort -u "${files[i]}"))"
done

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM