繁体   English   中英

awk比较两个文件并打印格式化的输出

[英]awk to compare two files and print formatted output

我想根据每个文件的第一个字段$1比较两个文件。

然后,从两个文件中填充匹配行-(在Aug.csv和Sep.csv中可用),并在最后一个字段中将注释打印为“匹配”

来自Aug.csv的不匹配行-(在Aug.csv中可用,在Sep.csv中不可用)并且未找到打印内容(例如“ NOT”),例如,相当于Sep.csv中的字段数($ NF)的5倍文件“ NOT,NOT,NOT,NOT,NOT”并打印最后一个字段备注为“不在Sep.csv中”或“ FILENAME”

来自Sep.csv的不匹配行-(在Sep.csv中可用,在Aug.csv中不可用),并且找不到与“ Aug.csv”中的字段数($ NF)相等的4倍打印(例如“ NOT”)文件“ NOT,NOT,NOT,NOT”并打印最后提交的备注为“ Not in Aug.csv”或FILENAME

Aug.csv

Name,Age,Place,Des
aaa,40,xxx,Aug
aaa,20,yyy,Aug
ccc,35,xxx,Aug

Sep.csv

Name,Age,Place,Edu,Des
aaa,50,zzz,eee,Sep
bbb,30,xxx,yyy,Sep
aaa,60,yyy,fff,Sep
bbb,50,yyy,fff,Sep

预期的输出.csv

Name,Age,Place,Des,Name,Age,Place,Edu,Des,Remarks
aaa,40,xxx,Aug,aaa,50,zzz,eee,Sep,Matched
aaa,40,xxx,Aug,aaa,60,yyy,fff,Sep,Matched
aaa,20,yyy,Aug,aaa,50,zzz,eee,Sep,Matched
aaa,20,yyy,Aug,aaa,60,yyy,fff,Sep,Matched
NOT,NOT,NOT,NOT,bbb,30,xxx,yyy,Sep,Not in Aug.csv
NOT,NOT,NOT,NOT,bbb,50,yyy,fff,Sep,Not in Aug.csv
ccc,35,xxx,Aug,NOT,NOT,NOT,NOT,NOT,Not in Sep.csv

我尝试了以下两个命令来获取所需的输出,但未成功

第一命令:

 awk -v first="NOT,NOT,NOT,NOT"  -v second="NOT,NOT,NOT,NOT,NOT" -F"," 'NR==FNR{a[$1]=$0;next}{if (a[$1])print a[$1],$0,"Matched";else print first, $0,"Not in Aug.csv";}' OFS="," Aug.csv Sep.csv >Output.csv

第二命令:

awk -v first="NOT,NOT,NOT,NOT"  -v second="NOT,NOT,NOT,NOT,NOT" -F"," 'NR==FNR{a[$1]=$0;next} !($1 in a) {print $0,second,"Not in Sep.csv";}' OFS="," Sep.csv Aug.csv  >>Output.csv  

从上面的命令得到下面的Output.csv

Name,Age,Place,Des,Name,Age,Place,Edu,Des,Matched
aaa,20,yyy,Aug,aaa,50,zzz,eee,Sep,Matched
aaa,20,yyy,Aug,aaa,60,yyy,fff,Sep,Matched
NOT,NOT,NOT,NOT,bbb,30,xxx,yyy,Sep,Not in Aug.csv
NOT,NOT,NOT,NOT,bbb,50,yyy,fff,Sep,Not in Aug.csv
ccc,35,xxx,Aug,NOT,NOT,NOT,NOT,NOT,Not in Sep.csv

在这里,我没有从“预期输出”中找到以下两条匹配的行(Aug.csv)。 请告知如何处理...似乎忽略了重复的条目

aaa,40,xxx,Aug,aaa,50,zzz,eee,Sep,Matched
aaa,40,xxx,Aug,aaa,60,yyy,fff,Sep,Matched

想知道如何将其变成动态变量"$first"和“ $second" (即awk -v first="NOT,NOT,NOT,NOT" -v second="NOT,NOT,NOT,NOT,NOT" )基于Aug.csv和Sep.csv中可用的字段/标题数,因为在原始文件中包含更多的字段数,并且过去每次都像10个字段,15个字段等而有所不同...不想手动键入10次“ NOT”,或者基于原始文件中的“字段数”,在打印"FS"时是否有任何REPEAT方法。 这样我的输出将在下面格式化

预期的输出.csv

Name,Age,Place,Des,Name,Age,Place,Edu,Des,Remarks
aaa,40,xxx,Aug,aaa,50,zzz,eee,Sep,Matched
aaa,40,xxx,Aug,aaa,60,yyy,fff,Sep,Matched
aaa,20,yyy,Aug,aaa,50,zzz,eee,Sep,Matched
aaa,20,yyy,Aug,aaa,60,yyy,fff,Sep,Matched
,,,,bbb,30,xxx,yyy,Sep,Not in Aug.csv
,,,,bbb,50,yyy,fff,Sep,Not in Aug.csv
ccc,35,xxx,Aug,,,,,,Not in Sep.csv

请告知,寻找您的建议...

复杂的GNU awk解决方案:

compare.awk脚本:

function prNot(n) { 
    r=s="NOT"; while(--n) r=r FS s; 
    return r 
}
BEGIN{ FS=OFS="," }
NR==FNR{ 
    if (NR==1) { 
        sep_nf=NF; sep_fn=FILENAME; h=$0 
    } else { 
        sep[$1][++c]=$2; 
        for(i=3;i<=NF;i++){ sep[$1][c]=sep[$1][c] FS $i } 
    }
    next 
}
FNR==1{ 
    aug_nf=NF; aug_fn=FILENAME; print $0,h,"Remarks"; next 
}
$1 in sep{ matched[$1]; for(i in sep[$1]) print $0,$1,sep[$1][i],"Matched" }
!($1 in sep){ print $0,prNot(sep_nf),"Not in "sep_fn }
END{ 
    for(i in sep) 
        if (!(i in matched)) { 
            for(j in sep[i]) print prNot(aug_nf),i,sep[i][j],"Not in "aug_fn 
        }  
}

用法:

awk -f compare.awk Sep.csv Aug.csv

输出:

Name,Age,Place,Des,Name,Age,Place,Edu,Des,Remarks
aaa,40,xxx,Aug,aaa,50,zzz,eee,Sep,Matched
aaa,40,xxx,Aug,aaa,60,yyy,fff,Sep,Matched
aaa,20,yyy,Aug,aaa,50,zzz,eee,Sep,Matched
aaa,20,yyy,Aug,aaa,60,yyy,fff,Sep,Matched
ccc,35,xxx,Aug,NOT,NOT,NOT,NOT,NOT,Not in Sep.csv
NOT,NOT,NOT,NOT,bbb,30,xxx,yyy,Sep,Not in Aug.csv
NOT,NOT,NOT,NOT,bbb,50,yyy,fff,Sep,Not in Aug.csv

对于真正的多维数组,使用GNU awk:

$ cat tst.awk
BEGIN { FS=OFS="," }
FNR==1 {
    for (i=1; i<=NF; i++) {
        nots[ARGIND] = (i>1 ? nots[ARGIND] OFS : "") "NOT"
    }
}
NR==FNR {
    file1[$1][++cnt[$1]] = $0
    next
}
{
    file2[$1]
    if ($1 in file1) {
        for (num in file1[$1]) {
            print file1[$1][num], $0, (FNR>1 ? "Matched" : "Remarks")
        }
    }
    else {
        print nots[1], $0, "Not in " ARGV[1]
    }
}
END {
    for (name in file1) {
        if ( !(name in file2) ) {
            for (num in file1[name]) {
                print file1[name][num], nots[2], "Not in " ARGV[2]
            }
        }
    }
}

$ awk -f tst.awk Aug.csv Sep.csv
Name,Age,Place,Des,Name,Age,Place,Edu,Des,Remarks
aaa,40,xxx,Aug,aaa,50,zzz,eee,Sep,Matched
aaa,20,yyy,Aug,aaa,50,zzz,eee,Sep,Matched
NOT,NOT,NOT,NOT,bbb,30,xxx,yyy,Sep,Not in Aug.csv
aaa,40,xxx,Aug,aaa,60,yyy,fff,Sep,Matched
aaa,20,yyy,Aug,aaa,60,yyy,fff,Sep,Matched
NOT,NOT,NOT,NOT,bbb,50,yyy,fff,Sep,Not in Aug.csv
ccc,35,xxx,Aug,NOT,NOT,NOT,NOT,NOT,Not in Sep.csv

如果输出顺序很重要,那么有多种处理方式...

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM