将目录文件中的特定列提取到新文件中

Question

我收集了 11064 个文件，它们都具有相同的文件扩展名 ReadsPerGene.out.tab。 它们在一个目录中。 所有文件有 556 行，4 列。

Filenames look like this:
SRR123.ReadsPerGene.out.tab
SRR456.ReadsPerGene.out.tab
SRR555.ReadsPerGene.out.tab
DRR789.ReadsPerGene.out.tab
...

File looks like this:
for SRR123ReadsPerGene.out.tab        for SRR789.ReadsPerGene.out.tab
A    45   67   78                       A    89O  90   34
B    17   40   23                       B    129  96   45
C    27   50   19                       C     60  56   91
...  ...  ...  ...                     ...   ...  ...  ...

首先我要判断所有文件的第一列是否相同。

如果是这样，我想创建一个包含 665 行、11065 列的 output.txt 文件。 第一列是每个文件的第一列（因为它们相同）。 output.txt 的第 11065 列中的第 2 列是每个输入文件的第 2 列，我想将特定文件名添加为每一列的第一行。

The output.txt looks like this:

      SRR123                SRR789              SRR456        ...
A        45                 89O                66            ...
B        17                 129                480           ...
C        27                  60                78            ...
...      ...               ...               ...             ...

以下是我的回答。 **
**1。 获取所有文件名

#!/bin/bash
cd ~
filepath=/home/shared/maize/bam_rsem
cd ${filepath}
for file in $(ls *.ReadsPerGene.out.tab)
do
   echo $file >> ~/filename.txt
done

2.获取一个文件中的所有第一列

#!/bin/bash
cd ~
OUT=result2.txt
touch $OUT
filepath=/home/shared/maize/bam_rsem/
for file in $(cat filename.txt)
do
   filePATH=`echo ${filepath}$file`
   cut -f 1 $filePATH | sed 1i\ ${file} >$OUT.tmp1
   paste $OUT $OUT.tmp1 >$OUT.tmp
   rm $OUT.tmp1
   mv $OUT.tmp $OUT
done

3.比较result2.txt中第一列是否与其他列相同
我现在不知道。
4.创建一个output.txt

#!/bin/bash
cd ~
OUT=result2.txt
touch $OUT
filepath=/home/shared/maize/bam_rsem/
for file in $(cat filename.txt)
do
   filePATH=`echo ${filepath}$file`
   cut -f 1 $filePATH | sed 1i\ ${file} >$OUT.tmp1
   paste $OUT $OUT.tmp1 >$OUT.tmp
   rm $OUT.tmp1
   mv $OUT.tmp $OUT
done

cut -f 1 result2.txt >$OUT.tmp2
paste $OUT.tmp2 $OUT >$OUT.tmp3
rm $OUT.tmp2
mv $OUT.tmp3 $OUT

我应该为我的脚本做什么？ 在 Linux 中执行我的脚本真的很慢。 或者我应该写一个 Python 脚本来处理它，但我从未学过 python 或 Perl，我只对 Linux 了解一点。

很抱歉我的英文不好，没能及时回复。 无论如何，感谢您的所有回答！

Answer 1

一个在 awk 中。 要处理的文件名在files中（由于数量很多）：

$ cat files
SRR123.ReadsPerGene.out.tab
SRR789.ReadsPerGene.out.tab

awk 程序将在包含数据文件的目录中运行（将文件名的第一个.分隔部分split为标题，即前导路径会使标题名称非常冗长）：

$ awk '
BEGIN{OFS="\t"}
{
    files[NR]=$0                                # hash filenames from file files
}
END{
    for(i=1;i<=NR;i++) {                        # loop files
        nr=0
        split(files[i],t,".")
        h[nr]=h[nr] OFS t[1]                    # build header
        while((getline < files[i])>0) {         # using getline to read data records
            nr++                                # d[++nr] order not same in all awks
            d[nr]=d[nr] OFS $2                  # append data fields to previous
            if(i==1) {                          # get headers from first file
                h[(refnr=nr)]=$1
            } else if($1!=h[nr]) {              # check that they stay the same
                print "Nonmatching field name"
                exit                            # or exit without output
            }
        }
        if(nr!=refnr) {                         # also record count must be the same
            print "Nonmatching record count"
            exit
        }
        close(files[i])
    }
    for(i=0;i<=refnr;i++)                       # output part
        print h[i] d[i]
}' files

输出：

        SRR123  SRR789
A       45      89O
B       17      129
C       27      60
...     ...     ...

[++nr] order not same in all awks ：显然有些 awks 更喜欢d[++nr]=d[nr] OFS $2和一些d[nr]=d[++nr] OFS $2所以单独的nr++适用于两者.

更新：

如果文件在不同的路径中，并且文件files中的文件名没有包含路径，请智能替换：

split(files[i],t,".")
...
while((getline < files[i])>0) {

和

file="home/shared/maize/bam_rsem/" files[i]
split(file,t,".")
...
while((getline < file)>0) {

和

close(files[i])

和

close(file)

Answer 2

试试这个，让我知道它是否在这个答案的评论部分有效。

import pandas as pd
import glob

files = sorted(glob.glob("*.log.out", recursive=False))

#dropped_col_1 = list()
#kept_col_2 = list()
drop_files = dict()
keep_files = dict()
ref_file_name = 'SRR123.log.out'
df_ref_file = pd.read_csv(ref_file_name, sep='\t', header=None)
for i, filename in enumerate(files):
    df_file = pd.read_csv(filename, sep='\t', header=None)
    if df_ref_file['0'] != df_file['0']:
        drop_files.update({filename: df_file['0'].tolist()})
        #dropped_col_1.append(df_file['0'].tolist())
    else:
        keep_files.update({filename: df_file['1'].tolist()})        
        #kept_col_2.append(df_file['1'].tolist())

df = pd.DataFrame(keep_files, index=df_ref_file['0'])
df.index.names = ['ID']
df.reset_index(inplace=True)
# check the shape of the dataframe
df.shape

将目录文件中的特定列提取到新文件中

问题描述

2 个解决方案

解决方案1
2 已采纳 2019-11-30 07:32:40

解决方案2
0 2019-11-30 05:00:07

将目录文件中的特定列提取到新文件中

问题描述

2 个解决方案

解决方案1 2 已采纳 2019-11-30 07:32:40

解决方案2 0 2019-11-30 05:00:07

解决方案1
2 已采纳 2019-11-30 07:32:40

解决方案2
0 2019-11-30 05:00:07