简体   繁体   English

如何从不同文件中提取特定列并在一个文件中输出?

[英]How to extract specific columns from different files and output in one file?

I have in a directory 12 files , each file has 4 columns. 我在目录中有12个文件,每个文件有4列。 The first column is a gene name and the rest 3 are count columns. 第一列是基因名称,其余三列是计数列。 All the files are in the same directory. 所有文件都在同一目录中。 I want to extract 1,4 columns for each files (12 files in total) and paste them in one output file, since the first column is same for every files the output file should only have one once the 1st column and the rest will be followed by 4th column of each file. 我想为每个文件提取1,4列(总共12个文件)并将其粘贴到一个输出文件中,因为每个文件的第一列都相同,因此输出文件在第一列中应该只有一个。随后是每个文件的第四列。 The first column of each file is same. 每个文件的第一列都是相同的。 I do not want to use R here. 我不想在这里使用R。 I am a big fan of awk. 我是awk的忠实粉丝。 So I was trying something like below but it did not work 所以我在尝试类似下面的方法,但是没有用

My input files look like Input file 1 我的输入文件看起来像输入文件1

ZYG11B  8267    16.5021 2743.51
ZYG11A  4396    0.28755 25.4208
ZXDA    5329    2.08348 223.281
ZWINT   1976    41.7037 1523.34
ZSCAN5B 1751    0.0375582   1.32254
ZSCAN30 4471    4.71253 407.923
ZSCAN23 3286    0.347228    22.9457
ZSCAN20 4343    3.89701 340.361
ZSCAN2  3872    3.13983 159.604
ZSCAN16-AS1 2311    1.1994  50.9903

Input file 2 输入文件2

ZYG11B  8267    18.2739 2994.35
ZYG11A  4396    0.227859    19.854
ZXDA    5329    2.44019 257.746
ZWINT   1976    8.80185 312.072
ZSCAN5B 1751    0   0
ZSCAN30 4471    9.13324 768.278
ZSCAN23 3286    1.03543 67.4392
ZSCAN20 4343    3.70209 318.683
ZSCAN2  3872    5.46773 307.038
ZSCAN16-AS1 2311    3.18739 133.556

Input file 3 输入文件3

ZYG11B  8267    20.7202 3593.85
ZYG11A  4396    0.323899    29.8735
ZXDA    5329    1.26338 141.254
ZWINT   1976    56.6215 2156.05
ZSCAN5B 1751    0.0364084   1.33754
ZSCAN30 4471    6.61786 596.161
ZSCAN23 3286    0.79125 54.5507
ZSCAN20 4343    3.9199  357.177
ZSCAN2  3872    5.89459 267.58
ZSCAN16-AS1 2311    2.43055 107.803

Desired output from above 所需的上方输出

ZYG11B  2743.51 2994.35 3593.85
    ZYG11A  25.4208 19.854  29.8735
    ZXDA    223.281 257.746 141.254
    ZWINT   1523.34 312.072 2156.05
    ZSCAN5B 1.32254 0   1.33754
    ZSCAN30 407.923 768.278 596.161
    ZSCAN23 22.9457 67.4392 54.5507
    ZSCAN20 340.361 318.683 357.177
    ZSCAN2  159.604 307.038 267.58
    ZSCAN16-AS1 50.9903 133.556 107.803

here as you can see above first column from each file and 4 column , since the first column of each file is same so I just want to keep it once and rest the ouptut will have 4th column of each file. 正如您在上方看到的每个文件的第一列和第4列一样,由于每个文件的第一列是相同的,所以我只想保留一次,其余的每个文件的第4列。 I have just shown for 3 files. 我刚刚展示了3个文件。 It should work for all the files in the directory at once since all of the files have similar naming conventions like file1_quant.genes.sf file2_quant.genes.sf , file3_quant.genes.sf 它应该一次对目录中的所有文件起作用,因为所有文件都具有类似的命名约定,例如file1_quant.genes.sf file2_quant.genes.sf,file3_quant.genes.sf

Every files has same first column but different counts in rest column. 每个文件的第一列相同,而其余列的计数不同。 My idea is to create one output file which should have 1st column and 4th column from all the files. 我的想法是创建一个输出文件,该文件在所有文件中应具有第一列和第四列。

awk '{print $1,$2,$4}' *_quant.genes.sf > genes.estreads

Any heads up? 有没有抬头?

If I understand you correctly, what you're looking for is one line per key, collated from multiple files. 如果我对您的理解正确,那么您要查找的是每个键一行,由多个文件整理而成。

The tool you need for this job is an associative array. 这项工作所需的工具是关联数组。 I think awk can, but I'm not 100% sure. 我认为awk可以,但我不确定100%。 I'd probably tackle it in perl though: 我可能会在perl中解决它:

#!/usr/bin/perl
use strict;
use warnings;

# an associative array, or hash as perl calls it
my %data;

#iterate the input files (sort might be irrelevant here) 
foreach my $file ( sort glob("*_quant.genes.sf") ) {
    #open the file for reading. 
    open( my $input, '<', $file ) or die $!;
    #iterate line by line. 
    while (<$input>) {
        #extract the data - splitting on any whitespace. 
        my ( $key, @values ) = split; 
        #add'column 4' to the hash (of arrays)
        push( @{$data{$key}}, $values[2] );  
    }
    close($input);
}

#start output 
open( my $output, '>', 'genes.estreads' ) or die;
#sort, because hashes are explicitly unordered. 
foreach my $key ( sort keys %data ) { 
    #print they key and all the elements collected. 
    print {$output} join( "\t", $key, @{ $data{$key} } ), "\n";
}
close($output);

With data as specified as above, this produces: 使用上面指定的数据,将产生:

ZSCAN16-AS1 50.9903 133.556 107.803
ZSCAN2  159.604 307.038 267.58
ZSCAN20 340.361 318.683 357.177
ZSCAN23 22.9457 67.4392 54.5507
ZSCAN30 407.923 768.278 596.161
ZSCAN5B 1.32254 0   1.33754
ZWINT   1523.34 312.072 2156.05
ZXDA    223.281 257.746 141.254
ZYG11A  25.4208 19.854  29.8735
ZYG11B  2743.51 2994.35 3593.85

The following is how you do it in awk : 以下是在awk

awk 'BEGIN{FS = " "};{print $1, $4}' *|awk 'BEGIN{FS = " "};{temp = x[$1];x[$1] = temp  " "  $2;};END {for(xx in x) print xx,x[xx]}'

As cryptic as it looks, I am just using associative arrays. 尽管看起来很神秘,但我只是在使用关联数组。


Here is the solution broken down: 这是分解的解决方案:

  1. Just print the key and the value, one per line. 只需打印键和值,每行一个。

    print $1, $2

  2. Store the data in an associative array, keep updating 将数据存储在关联数组中,并不断更新

    temp = x[$1];x[$1] = temp " " $2;}

  3. Display it: 显示它:

    for(xx in x) print xx,x[xx]

Sample run: 样品运行:

 [cloudera@quickstart test]$ cat f1 A k1 B k2 [cloudera@quickstart test]$ cat f2 A k3 B k4 C k1 [cloudera@quickstart test]$ awk 'BEGIN{FS = " "};{print $1, $2}' *|awk 'BEGIN{FS = " "};{temp = x[$1];x[$1] = temp " " $2;};END {for(xx in x) print xx,x[xx]}' A k1 k3 B k2 k4 C k1 

As a side note, the approach should be reminiscent of the Map Reduce paradigm. 附带说明一下,该方法应该让人联想到Map Reduce范式。

awk '{E[$1]=E[$1] "\t" $4}END{for(K in E)print K E[K]}' *_quant.genes.sf > genes.estreads

顺序是读取文件时的出现顺序(因此通常基于1个读取的文件)

If the first column is the same in all the files, you can use paste : 如果所有文件的第一列都相同,则可以使用paste

paste <(tabify f1 | cut -f1,4) \
      <(tabify f2 | cut -f4)   \
      <(tabify f3 | cut -f4)

Where tabify changes consecutive spaces to tabs: tabify会将选项卡的连续空格更改为:

sed 's/ \+/\t/g' "$@"

and f1, f2, f3 are the input files' names. f1,f2,f3是输入文件的名称。

这是在Perl中执行此操作的另一种方法:

 perl -lane '$data{$F[0]} .= " $F[3]"; END { print "$_ $data{$_}" for keys %data }' input_file_1 input_file_2 input_file_3

Here's another way of doing it with awk. 这是使用awk的另一种方法。 And it supports using multiple files. 并且它支持使用多个文件。

awk 'FNR==1{f++}{a[f,FNR]=$1}{b[f,FNR]=$4}END { for(x=1;x<=FNR;x++){printf("%s ",a[1,x]);for(y=0;y<=ARGC;y++)printf("%s ",b[y,x]);print ""}}' input1.txt input2.txt input3.txt 

That line of code, give the following output 该行代码,给出以下输出

ZYG11B  2743.51 2994.35 3593.85  
ZYG11A  25.4208 19.854 29.8735  
ZXDA  223.281 257.746 141.254  
ZWINT  1523.34 312.072 2156.05  
ZSCAN5B  1.32254 0 1.33754  
ZSCAN30  407.923 768.278 596.161  
ZSCAN23  22.9457 67.4392 54.5507  
ZSCAN20  340.361 318.683 357.177  
ZSCAN2  159.604 307.038 267.58  
ZSCAN16-AS1  50.9903 133.556 107.803 

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM