简体   繁体   English

使用awk处理多个文件

[英]Process multiple file using awk

I've got to process lots of txt files (16 million of rows for each file) using awk. 我必须使用awk处理大量的txt文件(每个文件有1600万行)。 I've got to read for example ten files: 我必须阅读例如十个文件:

File #1: 档案#1:

en sample_1 200
en.n sample_2 10
en sample_3 10

File #2: 档案#2:

en sample_1 10
en sample_3 67

File #3: 文件#3:

en sample_1 1
en.n sample_2 10
en sample_4 20

... ...

I would like to have an output like this: 我希望有这样的输出:

source title f1 f2 f3 sum(f1,f2,f3) 源标题f1 f2 f3 sum(f1,f2,f3)

en sample_1 200 10 1 211
en.n sample_2 10 0 10 20
en sample_3 10 67 0 77
en sample_4 0 0 20 20 

Here my first version of code: 这是我的第一个代码版本:

#! /bin/bash
clear
#var declaration
BASEPATH=<path_to_file>
YEAR="2014"
RES_FOLDER="processed"
FINAL_RES="2014_06_01"
#results folder creation
mkdir $RES_FOLDER
#processing
awk 'NF>0{a[$1" "$2]=a[$1" "$2]" "$3}END{for(i in a){print i a[i]}}' $BASEPATH/$YEAR/* > $RES_FOLDER/$FINAL_RES

And here my output: 在这里我的输出:

en sample_1 200 10 1
en.n sample_2 10 10
en sample_3 10 67
en sample_4 20

I'm a little bit confused about how to put zero column where no occurrence is found and how to get the sum of all value. 我对如何在没有找到事件的地方放置零列以及如何获得所有值的总和有点困惑。 I know I've to use this: 我知道我要用这个:

{tot[$1" "$2]+=$3} END{for (key in tot) print key, tot[key]}

Hope someone will help. 希望有人会帮忙。 Thank you. 谢谢。

******** EDITED ******** ********编辑********

I'm trying to achieve my result in a different kind of way. 我试图以不同的方式实现我的结果。 I create a bash script like this, It produces a sorted file with all of my keys, it's very huge, about 62 millions of record, I slice this file into pieces and I pass each piece to my awk script. 我创建了一个像这样的bash脚本,它生成一个带有我所有键的排序文件,它非常庞大,大约有6200万条记录,我将这个文件分成几块,然后将每个文件传递给我的awk脚本。

BASH: BASH:

#! /bin/bash
clear
FILENAME=<result>
BASEPATH=<base_path>
mkdir processed/slice
cat $BASEPATH/dataset/* | cut -d' ' -f1,2 > $BASEPATH/processed/aggr
sort -u -k2 $BASEPATH/processed/aggr > $BASEPATH/processed/sorted
split -d -l 1000000 processed/sorted processed/slice/slice-
echo $(date "+START PROCESSING DATE: %d/%m/%y - TIME: %H:%M:%S")
for filename in processed/slice/*; do
  awk -v filename="$filename" -f algorithm.awk dataset/* >> processed/$FILENAME
done
echo $(date "+END PROCESSING DATE: %d/%m/%y - TIME: %H:%M:%S")
rm $BASEPATH/processed/aggr
rm $BASEPATH/processed/sorted
rm -rf $BASEPATH/processed/slice

AWK: AWK:

BEGIN{
while(getline < filename){
 key=$1" "$2;
 sources[key];
 for(i=1;i<11;i++){
   keys[key"-"i] = "0";
 }
}
close(filename);
}
{
if(FNR==1){
 ARGIND++;
}
key=$1" "$2;
keys[key"-"ARGIND] = $3
}END{
for (s in sources) {
 sum = 0
 printf "%s", s
 for (j=1;j<11;j++) {
   printf "%s%s", OFS, keys[s"-"j]
   sum += keys[s"-"j]
 }
print " "sum
}
}

With awk I preallocate my final array, and reading dataset/* folder I populate its content. 使用awk我预先分配我的最终数组,并读取dataset/*文件夹我填充其内容。 I've figured out that my bottleneck came from iterating on dataset folder by awk input (10 files with 16.000.000 lines each). 我已经发现我的瓶颈来自于通过awk输入迭代数据集文件夹(10个文件,每个文件有16.000.000行)。 Everything is working on a small set of data, but with real data, RAM (30GB) congested. 一切都在处理一小组数据,但是对于真实数据,RAM(30GB)拥挤不堪。 Does anyone have any suggestions or advices? 有没有人有任何建议或意见? Thank you. 谢谢。

$ cat tst.awk
{
    key = $1" "$2
    keys[key]
    val[key,ARGIND] = $3
}
END {
    for (key in keys) {
        sum = 0
        printf "%s", key
        for (fileNr=1;fileNr<=ARGIND;fileNr++) {
            printf "%s%s", OFS, val[key,fileNr]+0
            sum += val[key,fileNr]
        }
        print sum
    }
}

$ awk -f tst.awk file1 file2 file3
en sample_4 0 0 2020
en.n sample_2 10 0 1020
en sample_1 200 10 1211
en sample_3 10 67 077

The above uses GNU awk for ARGIND, with other awks just add a line FNR==1{ARGIND++} at the start. 以上使用GNU awk作为ARGIND,其他awks只是在开头添加一行FNR==1{ARGIND++} Pipe the output to sort if necessary. 如有必要,将输出通过管道sort

awk -vn="<source> <title>" 'function w(m,p){while(split(a[m],t)!=b+2)sub(p," 0&",a[m])}FNR<2{f=FILENAME;o=o?o" <"f">":"<"f">";q=q?q","f:f;++b}{a[$1" "$2]=a[$1" "$2]?a[$1" "$2]" "$NF:$0;w($1" "$2," [^ ]*$");c[$1" "$2]+=$NF}END{print n,o,"sum<("q")>";for(i in a){w(i,"$");print a[i],c[i]|"sort -k2"}}' *
<source> <title> <f1> <f2> <f3> sum<(f1,f2,f3)>
en sample_1 200 10 1 211
en.n sample_2 10 0 10 20
en sample_3 10 67 0 77
en sample_4 0 0 20 20

Since your files are quite large, you might want to use join -- it might be faster and/or use less memory. 由于您的文件非常大,您可能希望使用join - 它可能更快和/或使用更少的内存。 However it requires the files to be sorted and to have a single join field. 但是,它需要对文件进行排序并具有单个连接字段。

join -a1 -a2 -e0 -o0,1.2,2.2     <(sed $'s/ /\034/' file1 | sort) \
                                 <(sed $'s/ /\034/' file2 | sort) | 
join -a1 -a2 -e0 -o0,1.2,1.3,2.2 - \
                                 <(sed $'s/ /\034/' file3 | sort) | 
awk '{sub(/\034/," "); print $0, $3+$4+$5}' 

Explanation provided upon request 根据要求提供说明

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM