如何使用其中一个的第一列作为索引来合并多个 .csv 文件（shell 脚本）

Question

How to merge multiple .csv files using the 1st column of one of them as an index (pref shell scripting - awk)如何使用其中一个的第一列作为索引来合并多个 .csv 文件（pref shell scripting - awk）

88 .csv files that look like this input files names ZBND19X.csv 88 个看起来像这样的 .csv 文件输入文件名称为 ZBND19X.csv

==> ZBND19X.csv <==
Gene,ZBND19X(26027342 pairs)
ENSTGUG00000013338 (GAPDH),984.31862471
ENSTGUG00000000915,862.597795025
ENSTGUG00000006651 (ARPP19),845.045872644
ENSTGUG00000005054 (CAMKV),823.404021741
ENSTGUG00000005949 (FTH1),585.628487964

and ZBND22V.csv和 ZBND22V.csv

==> ZBND39X.csv <==
Gene,ZBND39X(26558640 pairs)
ENSTGUG00000013338 (GAPDH),971.678203888
ENSTGUG00000005054 (CAMKV),687.81249397
ENSTGUG00000006651 (ARPP19),634.296191033
ENSTGUG00000002582 (ITM2A),613.756010638
ENSTGUG00000000915,588.002298061

output file name RPKM_all.csv输出文件名 RPKM_all.csv

Gene,ZBND19X(26027342 pairs),ZBND39X(26558640 pairs)
ENSTGUG00000013338 (GAPDH),984.31862471,971.678203888
ENSTGUG00000000915,862.597795025,588.002298061
ENSTGUG00000006651 (ARPP19),845.045872644,634.296191033
ENSTGUG00000005054 (CAMKV),823.404021741,687.81249397
ENSTGUG00000005949 (FTH1),585.628487964,0
ENSTGUG00000002582 (ITM2A),613.756010638,0

Adding the 0 when there is no corresponding value found.当没有找到对应的值时添加 0。

Answer 1

join can only work on two files at a time, here comes join一次只能同时处理两个文件

awk to the rescue! awk解救！

$ awk -F, 'FNR==1 {c++; h=h sep $2; sep=FS; next} 
                  {ks[$1]; a[$1,c]=$2} 
           END    {print h; 
                   for(k in ks) 
                     {printf "%s", k;
                      for(i=1;i<=c;i++) printf "%s", FS a[k,i]+0; 
                      print ""}}' files

disclaimier: only if the data can fit in memory, also the order will be lost but if important there are ways to handle it. 免责声明：仅当数据可以容纳在内存中时，顺序也会丢失，但如果重要，则可以使用一些方法来处理。

Explanation Conceptually creating a table (aka 2D array, matrix) and filling up the entries. 解释从概念上创建表（aka 2D数组，矩阵）并填充条目。 THe rows are indexed by key and columns by file number. 行由键索引，列由文件编号索引。 Since awk array is hashing the keys we treat header separately to stay in place. 由于awk数组对键进行哈希处理，因此我们将标头单独对待以保持原位。 a[k,i]+0 is to convert missing elements to 0 . a[k,i]+0将丢失的元素转换为0 。

Answer 2

The simple answer is 'join'. 简单的答案是“加入”。

You can use the join command to match on the first column ( by default ) as long as the files are sorted. 只要文件已排序，就可以使用join命令在第一列上进行匹配（默认情况下）。

Don't forget to sort your files. 不要忘记对文件进行排序。

Did I mention you need to sort your files ;)? 我是否提到过您需要对文件进行排序;）？ It's an easy mistake to make ( I've made that mistake plenty; hence the emphasis ). 这是一个容易犯的错误（我已经使这个错误变得足够多了，因此成为重点）。

sort  ZBND19X.csv >  ZBND19X.csv.sorted
sort ZBND39X.csv  > ZBND39X.csv.sorted
join -t, ZBND19X.csv.sorted ZBND39X.csv.sorted > RPKM_all.csv

Here's the contents of RPKM_all.csv after running above : 在上面运行后，这是RPKM_all.csv的内容：

ENSTGUG00000000915,862.597795025,588.002298061
ENSTGUG00000005054 (CAMKV),823.404021741,687.81249397
ENSTGUG00000006651 (ARPP19),845.045872644,634.296191033
ENSTGUG00000013338 (GAPDH),984.31862471,971.678203888
Gene,ZBND19X(26027342 pairs),ZBND39X(26558640 pairs)

We can also look for rows that don't match like this: 我们还可以查找不匹配的行，如下所示：

$ join -v1 -t, ZBND19X.csv.sorted ZBND39X.csv.sorted | awk -F, -v OFS=, '{print $1,$2,0}'
ENSTGUG00000005949 (FTH1),585.628487964,0

$ join -v2 -t, ZBND19X.csv.sorted ZBND39X.csv.sorted | awk -F, -v OFS=, '{print $1,0,$2}'
ENSTGUG00000002582 (ITM2A),0,613.756010638

Now you can combine the whole thing: 现在，您可以结合整个过程：

sort  ZBND19X.csv >  ZBND19X.csv.sorted
sort ZBND39X.csv  > ZBND39X.csv.sorted
join -t, ZBND19X.csv.sorted ZBND39X.csv.sorted > RPKM_all.csv
join -v1 -t, ZBND19X.csv.sorted ZBND39X.csv.sorted | awk -F, -v OFS=, '{print $1,$2,0}' >> RPKM_all.csv
join -v2 -t, ZBND19X.csv.sorted ZBND39X.csv.sorted | awk -F, -v OFS=, '{print $1,0,$2}'  >> RPKM_all.csv

Answer 3

the awk code (awk -F, 'FNR==1 {c++; h=h sep $2; sep=FS; next} ): does anyone can do more explanation on this, the code doesn't print the header correctly, all the headers just jump to different rows and the first header is missing too awk 代码（awk -F, 'FNR==1 {c++; h=h sep $2; sep=FS; next} ）：有没有人可以对此做更多解释，代码没有正确打印标题，所有标题只是跳转到不同的行，第一个标题也丢失了

P21 P22 P24 P24 P21 P22 P24 P24

AamoA_EU022762 1 1 0 0 AamoA_EU099963 0 1 0 0 AamoA_EU022762 1 1 0 0 AamoA_EU099963 0 1 0 0

如何使用其中一个的第一列作为索引来合并多个 .csv 文件（shell 脚本）

问题描述

3 个解决方案

解决方案1
1 已采纳 2017-06-08 20:38:06

解决方案2
0 2017-06-08 19:37:06

解决方案3
0 2022-01-08 00:03:08

如何使用其中一个的第一列作为索引来合并多个 .csv 文件（shell 脚本）

问题描述

3 个解决方案

解决方案1 1 已采纳 2017-06-08 20:38:06

解决方案2 0 2017-06-08 19:37:06

解决方案3 0 2022-01-08 00:03:08

解决方案1
1 已采纳 2017-06-08 20:38:06

解决方案2
0 2017-06-08 19:37:06

解决方案3
0 2022-01-08 00:03:08