[英]Extract specific column from files of a directory into a new file

I have a collection of 11064 files, and they all have same file extension ReadsPerGene.out.tab.我收集了 11064 个文件,它们都具有相同的文件扩展名 ReadsPerGene.out.tab。 They are in a directory.它们在一个目录中。 All the files have 556 lines, 4 columns.所有文件有 556 行,4 列。

Filenames look like this:

File looks like this:
for SRR123ReadsPerGene.out.tab        for SRR789.ReadsPerGene.out.tab
A    45   67   78                       A    89O  90   34
B    17   40   23                       B    129  96   45
C    27   50   19                       C     60  56   91
...  ...  ...  ...                     ...   ...  ...  ...                                           

First, I want to judge whether the first column of all the files are the same.首先我要判断所有文件的第一列是否相同。

If it's true, I want to create an output.txt file with 665 lines, 11065 columns.如果是这样,我想创建一个包含 665 行、11065 列的 output.txt 文件。 The 1th column is the first column of every file(because they are same).第一列是每个文件的第一列(因为它们相同)。 And 2th of 11065th columns of output.txt are the 2th column of each input file, I want to add the specific filename as the first line for every column. output.txt 的第 11065 列中的第 2 列是每个输入文件的第 2 列,我想将特定文件名添加为每一列的第一行。

The output.txt looks like this:

      SRR123                SRR789              SRR456        ...
A        45                 89O                66            ...
B        17                 129                480           ...
C        27                  60                78            ...
...      ...               ...               ...             ...

The following are my answers.以下是我的回答。 ** **
**1. **1。 get all filenames获取所有文件名

cd ~
cd ${filepath}
for file in $(ls *.ReadsPerGene.out.tab)
   echo $file >> ~/filename.txt

2. get all the first column in one file 2.获取一个文件中的所有第一列

cd ~
touch $OUT
for file in $(cat filename.txt)
   filePATH=`echo ${filepath}$file`
   cut -f 1 $filePATH | sed 1i\ ${file} >$OUT.tmp1
   paste $OUT $OUT.tmp1 >$OUT.tmp
   rm $OUT.tmp1
   mv $OUT.tmp $OUT

3. compare whether the first column is identical with other columns in result2.txt 3.比较result2.txt中第一列是否与其他列相同
I have no idea now.我现在不知道。
4. create an output.txt 4.创建一个output.txt

cd ~
touch $OUT
for file in $(cat filename.txt)
   filePATH=`echo ${filepath}$file`
   cut -f 1 $filePATH | sed 1i\ ${file} >$OUT.tmp1
   paste $OUT $OUT.tmp1 >$OUT.tmp
   rm $OUT.tmp1
   mv $OUT.tmp $OUT

cut -f 1 result2.txt >$OUT.tmp2
paste $OUT.tmp2 $OUT >$OUT.tmp3
rm $OUT.tmp2
mv $OUT.tmp3 $OUT

What should I do for my script?我应该为我的脚本做什么? It is really slow to execute my script in Linux.在 Linux 中执行我的脚本真的很慢。 Or should I write up a Python script to handle it,but I have never learned python or Perl and I just know a little about Linux.或者我应该写一个 Python 脚本来处理它,但我从未学过 python 或 Perl,我只对 Linux 了解一点。

I'm so sorry that my English is poor, I can not reply in time.很抱歉我的英文不好,没能及时回复。 Anyway, thanks for all your answers!无论如何,感谢您的所有回答!

One in awk.一个在 awk 中。 The filenames to process are in files (due to large number of them):要处理的文件名在files中(由于数量很多):

$ cat files

The awk program is to run in the directory with the data files ( split ing the first . separated part of the filename for the header, ie. leading path would make the header name pretty lengthy): awk 程序将在包含数据文件的目录中运行(将文件名的第一个.分隔部分split为标题,即前导路径会使标题名称非常冗长):

$ awk '
    files[NR]=$0                                # hash filenames from file files
    for(i=1;i<=NR;i++) {                        # loop files
        h[nr]=h[nr] OFS t[1]                    # build header
        while((getline < files[i])>0) {         # using getline to read data records
            nr++                                # d[++nr] order not same in all awks
            d[nr]=d[nr] OFS $2                  # append data fields to previous
            if(i==1) {                          # get headers from first file
            } else if($1!=h[nr]) {              # check that they stay the same
                print "Nonmatching field name"
                exit                            # or exit without output
        if(nr!=refnr) {                         # also record count must be the same
            print "Nonmatching record count"
    for(i=0;i<=refnr;i++)                       # output part
        print h[i] d[i]
}' files


        SRR123  SRR789
A       45      89O
B       17      129
C       27      60
...     ...     ...

[++nr] order not same in all awks : Apparently some awks prefer d[++nr]=d[nr] OFS $2 and some d[nr]=d[++nr] OFS $2 so separate nr++ works for both. [++nr] order not same in all awks :显然有些 awks 更喜欢d[++nr]=d[nr] OFS $2和一些d[nr]=d[++nr] OFS $2所以单独的nr++适用于两者.

Update :更新

If the files are in a different path and the filenames in the file files don't have paths included, replace intelligently:如果文件在不同的路径中,并且文件files中的文件名没有包含路径,请智能替换:

while((getline < files[i])>0) {


file="home/shared/maize/bam_rsem/" files[i]
while((getline < file)>0) {





Try this and let me know if it worked in the comments section of this answer.试试这个,让我知道它是否在这个答案的评论部分有效。

import pandas as pd
import glob

files = sorted(glob.glob("*.log.out", recursive=False))

#dropped_col_1 = list()
#kept_col_2 = list()
drop_files = dict()
keep_files = dict()
ref_file_name = 'SRR123.log.out'
df_ref_file = pd.read_csv(ref_file_name, sep='\t', header=None)
for i, filename in enumerate(files):
    df_file = pd.read_csv(filename, sep='\t', header=None)
    if df_ref_file['0'] != df_file['0']:
        drop_files.update({filename: df_file['0'].tolist()})
        keep_files.update({filename: df_file['1'].tolist()})        

df = pd.DataFrame(keep_files, index=df_ref_file['0'])
df.index.names = ['ID']
# check the shape of the dataframe


