简体   繁体   English

awk + ​​bash:合并任意数量的文件

[英]awk + bash: combining arbitrary number of files

I have a script that takes a number of data files with identical layout but different data and combines a specified data column into a new file, like this: 我有一个脚本,该脚本采用多个布局相同但数据不同的数据文件,并将指定的数据列组合到一个新文件中,如下所示:

gawk '{
        names[$1]= 1;
        data[$1,ARGIND]= $2
} END {
        for (i in names) print i"\t"data[i,1]"\t"data[i,2]"\t"data[i,3]
}' $1 $2 $3 > combined_data.txt

... where the row IDs can be found in the first column, and the interesting data in the second column. ...在第一列中找到行ID,在第二列中找到有趣的数据。

This works nicely, but not for an arbitrary number of files. 这很好,但不适用于任意数量的文件。 While I could simply add $4 $5 ... $n in the last line up to whatever maximum number of files I think I need, as well as add an equal n amount of "\\t"data[i,4]"\\t"data[i,5] ... "\\t"data[i,n] in the line above that (which does seem to work even for files smaller than n ; awk seems to disregard that n is larger than the number of input files in those cases), this seems like an "ugly" solution. 虽然我可以简单地增加$4 $5 ... $n在最后排队到任何我想我需要,以及添加相同文件的最大数量n"\\t"data[i,4]"\\t"data[i,5] ... "\\t"data[i,n]上面那一行中的"\\t"data[i,4]"\\t"data[i,5] ... "\\t"data[i,n] (即使对于小于n文件,它似乎也可以工作; awk似乎忽略了n大于n的数量)输入文件),这似乎是一个“丑陋”的解决方案。 Is there a way to make this script (or something that gives the same result) take an arbitrary number of input files? 有没有办法使此脚本(或提供相同结果的东西)采用任意数量的输入文件?

Or, even better, can you somehow incorporate a find in there, that searches through subfolders and finds files matching some criterium? 或者,甚至更好的是,您是否可以以某种方式在其中合并find ,通过子文件夹搜索并查找与某些条件匹配的文件?

Here is some sample data: 以下是一些示例数据:

file.1 文件1

A      554
B       13
C      634
D       84
E        9

file.2: file.2:

C      TRUE
E      TRUE
F      FALSE

expected output: 预期输出:

A      554
B       13
C      634       TRUE
D       84
E        9       TRUE
F                FALSE

This may be what you're looking for (uses GNU awk for ARGIND just like your original script): 这可能就是您要寻找的内容(就像您的原始脚本一样,将GNU awk用于ARGIND):

$ cat tst.awk
BEGIN { OFS="\t" }
!seen[$1]++ { keys[++numKeys]=$1 }
{ vals[$1,ARGIND]=$2 }
END {
    for (rowNr=1; rowNr<=numKeys; rowNr++) {
        key = keys[rowNr]
        printf "%s%s", key, OFS
        for (colNr=1; colNr<=ARGIND; colNr++) {
            printf "%s%s", vals[key,colNr], (colNr<ARGIND?OFS:ORS)
        }
    }
}

$ awk -f tst.awk file1 file2
A       554
B       13
C       634     TRUE
D       84
E       9       TRUE
F               FALSE

If you don't care about the order the rows are output in then all you need is: 如果您不关心行的输出顺序,那么您需要做的是:

BEGIN { OFS="\t" }
{ vals[$1,ARGIND]=$2; keys[$1] }
END {
    for (key in keys) {
        printf "%s%s", key, OFS
        for (colNr=1; colNr<=ARGIND; colNr++) {
            printf "%s%s", vals[key,colNr], (colNr<ARGIND?OFS:ORS)
        }
    }
}

You can access an arbitrary number of files via redirected getline on the ARGV list (bypassing awk's default file processing (via BEGIN and exit)): 您可以通过ARGV列表上的重定向getline访问任意数量的文件(绕过awk的默认文件处理(通过BEGIN和exit)):

awk 'BEGIN {
  for(i=1;i<=ARGC;++i){
    while (getline < ARGV[i]) {
      ...
      }
    }
  <END-type code>
  exit}' $(find -type f ...)

Supposing this naming schema for the input files: 1 2 .... 假设输入文件的命名模式为: 1 2 ...。

   gawk '{ 
        names[$1]=$1
        data[$1,ARGIND]=$2
      } 
      END {
        for (i in names) {
           printf("%s\t",i)
           for (x=1;x<=ARGIND;x++) {
             printf("%s\t", data[i,x])
             }
           print ""
           }
       }' [0-9]* > combined_data.txt

Results: 结果:

A   554 
B   13  
C   634 TRUE
D   84  
E   9   TRUE
F       FALSE

Another solution using join , bash , awk and tr , if file1 , file2 , file3 , etc. are sorted 如果对file1file2file3等进行了排序,则使用joinbashawktr另一种解决方案

multijoin.sh 多连接

#!/bin/bash
function __t { 
  join -a1 -a2 -o '1.1 2.1 1.2 2.2' - "$1" | 
  awk -vFS='[ ]' '{print ($1!=""?$1:$2),$3"_"$4;}'; 
}
CMD="cat '$1'"
for i in `seq 2 $#`; do
  CMD="$CMD | __t '${@:$i:1}'";
done
eval "$CMD | tr '_' '\t' | tr ' ' '\t'";

or, recursive version 或递归版本

#!/bin/bash
function __t { 
  join -a1 -a2 -o '1.1 2.1 1.2 2.2' - "$1" | 
  awk -vFS='[ ]' '{print ($1!=""?$1:$2),$3"_"$4;}'; 
}
function __r { 
  if [[ "$#" -gt 1 ]]; then
    __t "$1" | __r "${@:2}"; 
  else
    __t "$1"; 
  fi
}
__r "${@:2}" < "$1" | tr '_' '\t' | tr ' ' '\t'

NOTE: the data cannot contain the character _ , this was used as a wildcard 注意:数据不能包含字符_ ,该字符用作通配符

you get, 你得到,

./multijoin file1 file2
A   554
B   13
C   634 TRUE
D   84
E   9   TRUE
F       FALSE

for example, if file3 contains 例如,如果file3包含

A    111
D    222
E    333
./multijoin file1 file2 file3

you get, 你得到,

A   554       111
B   13      
C   634 TRUE    
D   84        222
E   9   TRUE  333
F       FALSE

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM