[英]How to merge multiple .csv files using the 1st column of one of them as a index (shell scripting)
How to merge multiple .csv files using the 1st column of one of them as an index (pref shell scripting - awk)如何使用其中一个的第一列作为索引来合并多个 .csv 文件(pref shell scripting - awk)
88 .csv files that look like this input files names ZBND19X.csv 88 个看起来像这样的 .csv 文件 输入文件名称为 ZBND19X.csv
==> ZBND19X.csv <==
Gene,ZBND19X(26027342 pairs)
ENSTGUG00000013338 (GAPDH),984.31862471
ENSTGUG00000000915,862.597795025
ENSTGUG00000006651 (ARPP19),845.045872644
ENSTGUG00000005054 (CAMKV),823.404021741
ENSTGUG00000005949 (FTH1),585.628487964
and ZBND22V.csv和 ZBND22V.csv
==> ZBND39X.csv <==
Gene,ZBND39X(26558640 pairs)
ENSTGUG00000013338 (GAPDH),971.678203888
ENSTGUG00000005054 (CAMKV),687.81249397
ENSTGUG00000006651 (ARPP19),634.296191033
ENSTGUG00000002582 (ITM2A),613.756010638
ENSTGUG00000000915,588.002298061
output file name RPKM_all.csv输出文件名 RPKM_all.csv
Gene,ZBND19X(26027342 pairs),ZBND39X(26558640 pairs)
ENSTGUG00000013338 (GAPDH),984.31862471,971.678203888
ENSTGUG00000000915,862.597795025,588.002298061
ENSTGUG00000006651 (ARPP19),845.045872644,634.296191033
ENSTGUG00000005054 (CAMKV),823.404021741,687.81249397
ENSTGUG00000005949 (FTH1),585.628487964,0
ENSTGUG00000002582 (ITM2A),613.756010638,0
Adding the 0 when there is no corresponding value found.当没有找到对应的值时添加 0。
join
can only work on two files at a time, here comes join
一次只能同时处理两个文件
awk
to the rescue! awk
解救!
$ awk -F, 'FNR==1 {c++; h=h sep $2; sep=FS; next}
{ks[$1]; a[$1,c]=$2}
END {print h;
for(k in ks)
{printf "%s", k;
for(i=1;i<=c;i++) printf "%s", FS a[k,i]+0;
print ""}}' files
disclaimier: only if the data can fit in memory, also the order will be lost but if important there are ways to handle it. 免责声明:仅当数据可以容纳在内存中时,顺序也会丢失,但如果重要,则可以使用一些方法来处理。
Explanation Conceptually creating a table (aka 2D array, matrix) and filling up the entries. 解释从概念上创建表(aka 2D数组,矩阵)并填充条目。 THe rows are indexed by key and columns by file number.
行由键索引,列由文件编号索引。 Since
awk
array is hashing the keys we treat header separately to stay in place. 由于
awk
数组对键进行哈希处理,因此我们将标头单独对待以保持原位。 a[k,i]+0
is to convert missing elements to 0
. a[k,i]+0
将丢失的元素转换为0
。
The simple answer is 'join'. 简单的答案是“加入”。
You can use the join command to match on the first column ( by default ) as long as the files are sorted. 只要文件已排序,就可以使用join命令在第一列上进行匹配(默认情况下)。
Don't forget to sort your files. 不要忘记对文件进行排序。
Did I mention you need to sort your files ;)? 我是否提到过您需要对文件进行排序;)? It's an easy mistake to make ( I've made that mistake plenty; hence the emphasis ).
这是一个容易犯的错误(我已经使这个错误变得足够多了,因此成为重点)。
sort ZBND19X.csv > ZBND19X.csv.sorted
sort ZBND39X.csv > ZBND39X.csv.sorted
join -t, ZBND19X.csv.sorted ZBND39X.csv.sorted > RPKM_all.csv
Here's the contents of RPKM_all.csv after running above : 在上面运行后,这是RPKM_all.csv的内容:
ENSTGUG00000000915,862.597795025,588.002298061
ENSTGUG00000005054 (CAMKV),823.404021741,687.81249397
ENSTGUG00000006651 (ARPP19),845.045872644,634.296191033
ENSTGUG00000013338 (GAPDH),984.31862471,971.678203888
Gene,ZBND19X(26027342 pairs),ZBND39X(26558640 pairs)
We can also look for rows that don't match like this: 我们还可以查找不匹配的行,如下所示:
$ join -v1 -t, ZBND19X.csv.sorted ZBND39X.csv.sorted | awk -F, -v OFS=, '{print $1,$2,0}'
ENSTGUG00000005949 (FTH1),585.628487964,0
$ join -v2 -t, ZBND19X.csv.sorted ZBND39X.csv.sorted | awk -F, -v OFS=, '{print $1,0,$2}'
ENSTGUG00000002582 (ITM2A),0,613.756010638
Now you can combine the whole thing: 现在,您可以结合整个过程:
sort ZBND19X.csv > ZBND19X.csv.sorted
sort ZBND39X.csv > ZBND39X.csv.sorted
join -t, ZBND19X.csv.sorted ZBND39X.csv.sorted > RPKM_all.csv
join -v1 -t, ZBND19X.csv.sorted ZBND39X.csv.sorted | awk -F, -v OFS=, '{print $1,$2,0}' >> RPKM_all.csv
join -v2 -t, ZBND19X.csv.sorted ZBND39X.csv.sorted | awk -F, -v OFS=, '{print $1,0,$2}' >> RPKM_all.csv
the awk code (awk -F, 'FNR==1 {c++; h=h sep $2; sep=FS; next} ): does anyone can do more explanation on this, the code doesn't print the header correctly, all the headers just jump to different rows and the first header is missing too awk 代码(awk -F, 'FNR==1 {c++; h=h sep $2; sep=FS; next} ):有没有人可以对此做更多解释,代码没有正确打印标题,所有标题只是跳转到不同的行,第一个标题也丢失了
P21 P22 P24 P24 P21 P22 P24 P24
AamoA_EU022762 1 1 0 0 AamoA_EU099963 0 1 0 0 AamoA_EU022762 1 1 0 0 AamoA_EU099963 0 1 0 0
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.