简体   繁体   English

如何使用其中一个的第一列作为索引来合并多个 .csv 文件(shell 脚本)

[英]How to merge multiple .csv files using the 1st column of one of them as a index (shell scripting)

How to merge multiple .csv files using the 1st column of one of them as an index (pref shell scripting - awk)如何使用其中一个的第一列作为索引来合并多个 .csv 文件(pref shell scripting - awk)

88 .csv files that look like this input files names ZBND19X.csv 88 个看起来像这样的 .csv 文件 输入文件名称为 ZBND19X.csv

==> ZBND19X.csv <==
Gene,ZBND19X(26027342 pairs)
ENSTGUG00000013338 (GAPDH),984.31862471
ENSTGUG00000000915,862.597795025
ENSTGUG00000006651 (ARPP19),845.045872644
ENSTGUG00000005054 (CAMKV),823.404021741
ENSTGUG00000005949 (FTH1),585.628487964

and ZBND22V.csv和 ZBND22V.csv

==> ZBND39X.csv <==
Gene,ZBND39X(26558640 pairs)
ENSTGUG00000013338 (GAPDH),971.678203888
ENSTGUG00000005054 (CAMKV),687.81249397
ENSTGUG00000006651 (ARPP19),634.296191033
ENSTGUG00000002582 (ITM2A),613.756010638
ENSTGUG00000000915,588.002298061

output file name RPKM_all.csv输出文件名 RPKM_all.csv

Gene,ZBND19X(26027342 pairs),ZBND39X(26558640 pairs)
ENSTGUG00000013338 (GAPDH),984.31862471,971.678203888
ENSTGUG00000000915,862.597795025,588.002298061
ENSTGUG00000006651 (ARPP19),845.045872644,634.296191033
ENSTGUG00000005054 (CAMKV),823.404021741,687.81249397
ENSTGUG00000005949 (FTH1),585.628487964,0
ENSTGUG00000002582 (ITM2A),613.756010638,0

Adding the 0 when there is no corresponding value found.当没有找到对应的值时添加 0。

join can only work on two files at a time, here comes join一次只能同时处理两个文件

awk to the rescue! awk解救!

$ awk -F, 'FNR==1 {c++; h=h sep $2; sep=FS; next} 
                  {ks[$1]; a[$1,c]=$2} 
           END    {print h; 
                   for(k in ks) 
                     {printf "%s", k;
                      for(i=1;i<=c;i++) printf "%s", FS a[k,i]+0; 
                      print ""}}' files

disclaimier: only if the data can fit in memory, also the order will be lost but if important there are ways to handle it. 免责声明:仅当数据可以容纳在内存中时,顺序也会丢失,但如果重要,则可以使用一些方法来处理。

Explanation Conceptually creating a table (aka 2D array, matrix) and filling up the entries. 解释从概念上创建表(aka 2D数组,矩阵)并填充条目。 THe rows are indexed by key and columns by file number. 行由键索引,列由文件编号索引。 Since awk array is hashing the keys we treat header separately to stay in place. 由于awk数组对键进行哈希处理,因此我们将标头单独对待以保持原位。 a[k,i]+0 is to convert missing elements to 0 . a[k,i]+0将丢失的元素转换为0

The simple answer is 'join'. 简单的答案是“加入”。

You can use the join command to match on the first column ( by default ) as long as the files are sorted. 只要文件已排序,就可以使用join命令在第一列上进行匹配(默认情况下)。

Don't forget to sort your files. 不要忘记对文件进行排序。

Did I mention you need to sort your files ;)? 我是否提到过您需要对文件进行排序;)? It's an easy mistake to make ( I've made that mistake plenty; hence the emphasis ). 这是一个容易犯的错误(我已经使这个错误变得足够多了,因此成为重点)。

sort  ZBND19X.csv >  ZBND19X.csv.sorted
sort ZBND39X.csv  > ZBND39X.csv.sorted
join -t, ZBND19X.csv.sorted ZBND39X.csv.sorted > RPKM_all.csv

Here's the contents of RPKM_all.csv after running above : 在上面运行后,这是RPKM_all.csv的内容:

ENSTGUG00000000915,862.597795025,588.002298061
ENSTGUG00000005054 (CAMKV),823.404021741,687.81249397
ENSTGUG00000006651 (ARPP19),845.045872644,634.296191033
ENSTGUG00000013338 (GAPDH),984.31862471,971.678203888
Gene,ZBND19X(26027342 pairs),ZBND39X(26558640 pairs)

We can also look for rows that don't match like this: 我们还可以查找不匹配的行,如下所示:

$ join -v1 -t, ZBND19X.csv.sorted ZBND39X.csv.sorted | awk -F, -v OFS=, '{print $1,$2,0}'
ENSTGUG00000005949 (FTH1),585.628487964,0

$ join -v2 -t, ZBND19X.csv.sorted ZBND39X.csv.sorted | awk -F, -v OFS=, '{print $1,0,$2}'
ENSTGUG00000002582 (ITM2A),0,613.756010638

Now you can combine the whole thing: 现在,您可以结合整个过程:

sort  ZBND19X.csv >  ZBND19X.csv.sorted
sort ZBND39X.csv  > ZBND39X.csv.sorted
join -t, ZBND19X.csv.sorted ZBND39X.csv.sorted > RPKM_all.csv
join -v1 -t, ZBND19X.csv.sorted ZBND39X.csv.sorted | awk -F, -v OFS=, '{print $1,$2,0}' >> RPKM_all.csv
join -v2 -t, ZBND19X.csv.sorted ZBND39X.csv.sorted | awk -F, -v OFS=, '{print $1,0,$2}'  >> RPKM_all.csv

the awk code (awk -F, 'FNR==1 {c++; h=h sep $2; sep=FS; next} ): does anyone can do more explanation on this, the code doesn't print the header correctly, all the headers just jump to different rows and the first header is missing too awk 代码(awk -F, 'FNR==1 {c++; h=h sep $2; sep=FS; next} ):有没有人可以对此做更多解释,代码没有正确打印标题,所有标题只是跳转到不同的行,第一个标题也丢失了

P21 P22 P24 P24 P21 P22 P24 P24

AamoA_EU022762 1 1 0 0 AamoA_EU099963 0 1 0 0 AamoA_EU022762 1 1 0 0 AamoA_EU099963 0 1 0 0

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 使用 shell 脚本忽略 csv 文件中的第一行(标题) - Ignore the 1st row(header) in a csv files using shell script 比较 2 个 csv 文件并将常见字段记录合并到第一个文件中 - Compare 2 csv files and merge common field records into 1st file 如何使用 unix shell 脚本将多个 csv 文件组合成一个 excel 表? - How to combine multiple csv files into a single excel sheet using unix shell scripting? 如何使用shell脚本从文本文件生成CSV - How to generate a CSV from text files using shell scripting 复杂的 csv 问题:如何在使用 shell 脚本比较多个 csv(以下方式)后生成最终 csv? - Complex csv question: how to generate a final csv after comparing multiple csvs (following manner) using shell scripting? 如何在子目录中搜索.c文件并进行编译(shell脚本) - How to search subdirectories for .c files and compile them (shell scripting) 使用unix命令或Shell脚本查找2个csv文件之间的差异 - Finding differences between 2 csv files using unix command or shell scripting 如何使用 shell 脚本比较两个不同 csv 文件中的两列? - How to compare two columns in two different csv files using shell scripting? 如何使用 shell 脚本对选定和未选定的 csv 文件执行不同的语句 - How to execute a different statement for selected & non selected csv files using shell scripting 使用Shell脚本比较2个不同文件中的一对一行 - Compare one to one lines in 2 different files using shell scripting
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM