[英]awk + bash: combining arbitrary number of files
I have a script that takes a number of data files with identical layout but different data and combines a specified data column into a new file, like this: 我有一个脚本,该脚本采用多个布局相同但数据不同的数据文件,并将指定的数据列组合到一个新文件中,如下所示:
gawk '{
names[$1]= 1;
data[$1,ARGIND]= $2
} END {
for (i in names) print i"\t"data[i,1]"\t"data[i,2]"\t"data[i,3]
}' $1 $2 $3 > combined_data.txt
... where the row IDs can be found in the first column, and the interesting data in the second column. ...在第一列中找到行ID,在第二列中找到有趣的数据。
This works nicely, but not for an arbitrary number of files. 这很好,但不适用于任意数量的文件。 While I could simply add
$4 $5 ... $n
in the last line up to whatever maximum number of files I think I need, as well as add an equal n
amount of "\\t"data[i,4]"\\t"data[i,5] ... "\\t"data[i,n]
in the line above that (which does seem to work even for files smaller than n
; awk seems to disregard that n
is larger than the number of input files in those cases), this seems like an "ugly" solution. 虽然我可以简单地增加
$4 $5 ... $n
在最后排队到任何我想我需要,以及添加相同文件的最大数量n
量"\\t"data[i,4]"\\t"data[i,5] ... "\\t"data[i,n]
上面那一行中的"\\t"data[i,4]"\\t"data[i,5] ... "\\t"data[i,n]
(即使对于小于n
文件,它似乎也可以工作; awk似乎忽略了n
大于n
的数量)输入文件),这似乎是一个“丑陋”的解决方案。 Is there a way to make this script (or something that gives the same result) take an arbitrary number of input files? 有没有办法使此脚本(或提供相同结果的东西)采用任意数量的输入文件?
Or, even better, can you somehow incorporate a find
in there, that searches through subfolders and finds files matching some criterium? 或者,甚至更好的是,您是否可以以某种方式在其中合并
find
,通过子文件夹搜索并查找与某些条件匹配的文件?
Here is some sample data: 以下是一些示例数据:
file.1 文件1
A 554
B 13
C 634
D 84
E 9
file.2: file.2:
C TRUE
E TRUE
F FALSE
expected output: 预期输出:
A 554
B 13
C 634 TRUE
D 84
E 9 TRUE
F FALSE
This may be what you're looking for (uses GNU awk for ARGIND just like your original script): 这可能就是您要寻找的内容(就像您的原始脚本一样,将GNU awk用于ARGIND):
$ cat tst.awk
BEGIN { OFS="\t" }
!seen[$1]++ { keys[++numKeys]=$1 }
{ vals[$1,ARGIND]=$2 }
END {
for (rowNr=1; rowNr<=numKeys; rowNr++) {
key = keys[rowNr]
printf "%s%s", key, OFS
for (colNr=1; colNr<=ARGIND; colNr++) {
printf "%s%s", vals[key,colNr], (colNr<ARGIND?OFS:ORS)
}
}
}
$ awk -f tst.awk file1 file2
A 554
B 13
C 634 TRUE
D 84
E 9 TRUE
F FALSE
If you don't care about the order the rows are output in then all you need is: 如果您不关心行的输出顺序,那么您需要做的是:
BEGIN { OFS="\t" }
{ vals[$1,ARGIND]=$2; keys[$1] }
END {
for (key in keys) {
printf "%s%s", key, OFS
for (colNr=1; colNr<=ARGIND; colNr++) {
printf "%s%s", vals[key,colNr], (colNr<ARGIND?OFS:ORS)
}
}
}
You can access an arbitrary number of files via redirected getline on the ARGV list (bypassing awk's default file processing (via BEGIN and exit)): 您可以通过ARGV列表上的重定向getline访问任意数量的文件(绕过awk的默认文件处理(通过BEGIN和exit)):
awk 'BEGIN {
for(i=1;i<=ARGC;++i){
while (getline < ARGV[i]) {
...
}
}
<END-type code>
exit}' $(find -type f ...)
Supposing this naming schema for the input files: 1
2
.... 假设输入文件的命名模式为:
1
2
...。
gawk '{
names[$1]=$1
data[$1,ARGIND]=$2
}
END {
for (i in names) {
printf("%s\t",i)
for (x=1;x<=ARGIND;x++) {
printf("%s\t", data[i,x])
}
print ""
}
}' [0-9]* > combined_data.txt
Results: 结果:
A 554
B 13
C 634 TRUE
D 84
E 9 TRUE
F FALSE
Another solution using join
, bash
, awk
and tr
, if file1
, file2
, file3
, etc. are sorted 如果对
file1
, file2
, file3
等进行了排序,则使用join
, bash
, awk
和tr
另一种解决方案
multijoin.sh 多连接
#!/bin/bash
function __t {
join -a1 -a2 -o '1.1 2.1 1.2 2.2' - "$1" |
awk -vFS='[ ]' '{print ($1!=""?$1:$2),$3"_"$4;}';
}
CMD="cat '$1'"
for i in `seq 2 $#`; do
CMD="$CMD | __t '${@:$i:1}'";
done
eval "$CMD | tr '_' '\t' | tr ' ' '\t'";
or, recursive version 或递归版本
#!/bin/bash
function __t {
join -a1 -a2 -o '1.1 2.1 1.2 2.2' - "$1" |
awk -vFS='[ ]' '{print ($1!=""?$1:$2),$3"_"$4;}';
}
function __r {
if [[ "$#" -gt 1 ]]; then
__t "$1" | __r "${@:2}";
else
__t "$1";
fi
}
__r "${@:2}" < "$1" | tr '_' '\t' | tr ' ' '\t'
NOTE: the data cannot contain the character _
, this was used as a wildcard 注意:数据不能包含字符
_
,该字符用作通配符
you get, 你得到,
./multijoin file1 file2
A 554 B 13 C 634 TRUE D 84 E 9 TRUE F FALSE
for example, if
file3
contains例如,如果
file3
包含
A 111 D 222 E 333
./multijoin file1 file2 file3
you get, 你得到,
A 554 111 B 13 C 634 TRUE D 84 222 E 9 TRUE 333 F FALSE
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.