简体   繁体   English

与多列进行文件比较

[英]File comparison with multiple columns

I am doing a directory cleanup to check for files that are not being used in our testing environment. 我正在进行目录清理以检查我们的测试环境中没有使用的文件。 I have a list of all the file names which are sorted alphabetically in a text file and another file I want to compare against. 我有一个列表,列出了在文本文件中按字母顺序排序的所有文件名,以及我要比较的另一个文件。

Here is how the first file is setup: 以下是第一个文件的设置方式:

test1.pl
test2.pl
test3.pl

It is a simple, one script name per line text file of all the scripts in the directory I want to clean up based on the other file below. 它是一个简单的,每行文本文件的一个脚本名称,我想根据下面的其他文件清理目录中的所有脚本。

The file I want to compare against is a tab file which lists a script that each server runs as a test and there are obviously many duplicates. 我要比较的文件是一个选项卡文件,它列出了每个服务器作为测试运行的脚本,显然有很多重复项。 I want to strip out the testing script names from this file and compare spit it out to another file, use uniq and sort so that I can diff this file with the above to see which testing scripts are not being used. 我想从这个文件中删除测试脚本名称并比较吐出到另一个文件,使用uniqsort以便我可以使用上面的文件来diff这个文件以查看哪些测试脚本没有被使用。

The file is setup as such: 该文件设置如下:

server: : test1.pl test2.pl test3.pl test4.sh test5.sh

There are some lines with less and some with more. 有些线路较少,部分线路较多。 My first impulse was to make a Perl script to split the line and push the values in an list if they are not there but that seems wholly inefficient. 我的第一个冲动是制作一个Perl脚本来分割线并将值推入列表中,如果它们不在那里但是看起来完全没有效率。 I am not to experienced in awk but I figured there is more than one way to do it. 我不是在awk经历过,但我认为有不止一种方法可以做到这一点。 Any other ideas to compare these files? 还有其他想法来比较这些文件吗?

这通过awk将文件名重新排列为第二个文件中的每行一个,然后将输出与第一个文件区diff

diff file1 <(awk '{ for (i=3; i<=NF; i++) print $i }' file2 | sort -u)

A Perl solution that makes a %needed hash of the files being used by the servers and then checks against the file containing all the file names. Perl解决方案,它对服务器使用的文件进行%needed哈希,然后检查包含所有文件名的文件。

#!/usr/bin/perl
use strict;
use warnings;
use Inline::Files;

my %needed;
while (<SERVTEST>) {
    chomp;
    my (undef, @files) = split /\t/;
    @needed{ @files } = (1) x @files;
}

while (<TESTFILES>) {
    chomp;
    if (not $needed{$_}) {
        print "Not needed: $_\n";   
    }
}

__TESTFILES__
test1.pl
test2.pl
test3.pl
test4.pl
test5.pl
__SERVTEST__
server1::   test1.pl    test3.pl
server2::   test2.pl    test3.pl
__END__
*** prints

C:\Old_Data\perlp>perl t7.pl
Not needed: test4.pl
Not needed: test5.pl

Quick and dirty script to do the job. 快速而肮脏的脚本来完成这项工作。 If it sounds good, use open to read the files with proper error checking. 如果听起来不错,请使用open来通过正确的错误检查来读取文件。

use strict;
use warnings;
my @server_lines = `cat server_file`;chomp(@server_lines);
my @test_file_lines = `cat test_file_lines`;chomp(@test_file_lines);
foreach my $server_line (@server_lines){
   $server_line =~ s!server: : !!is;
   my @files_to_check = split(/\s+/is, $server_line);
   foreach my $file_to_check (@files_to_check){
      my @found = grep { /$file_to_check/ } @test_file_lines;
      if (scalar(@found)==0){
        print "$file_to_check is not found in $server_line\n";
      }
   }

} }

If I understand your need correctly you have a file with a list of tests (testfiles.txt): 如果我正确理解您的需要,您将拥有一个包含测试列表的文件(testfiles.txt):

test1.pl
test2.pl 
test3.pl
test4.pl
test5.pl

And a file with a list of servers, with files they all test (serverlist.txt): 还有一个包含服务器列表的文件,其中包含所有测试的文件(serverlist.txt):

server1:        :       test1.pl        test3.pl
server2:        :       test2.pl        test3.pl

(Where I have assumed all spaces as tabs). (我把所有空格都当作标签)。

If you convert the second file into a list of tested files, you can then compare this using diff to your original file. 如果将第二个文件转换为测试文件列表,则可以使用diff将其与原始文件进行diff

cut -d: -f3 serverlist.txt | sed -e 's/^\t//g' | tr '\t' '\n' | sort -u > tested_files.txt

The cut removes the server name and ':', the sed removes the leading tab left behind, tr then converts the remaining tabs into newlines, then we do a unique sort to sort and remove duplicates. cut删除了服务器名称和':', sed删除了留下的前导标签, tr然后将剩余的标签转换为换行符,然后我们进行独特的排序以排序和删除重复项。 This is output to tested_files.txt . 这是输出到tested_files.txt

Then all you do is diff testfiles.txt tested_files.txt . 然后你所做的就是diff testfiles.txt tested_files.txt

It's hard to tell since you didn't post the expected output but is this what you're looking for? 很难说,因为你没有发布预期的输出,但这是你想要的吗?

$ cat file1
test1.pl
test2.pl
test3.pl
$
$ cat file2
server: : test1.pl test2.pl test3.pl test4.sh test5.sh
$
$ gawk -v RS='[[:space:]]+' 'NR==FNR{f[$0]++;next} FNR>2 && !f[$0]' file1 file2
test4.sh
test5.sh

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM