简体   繁体   English

将目录文件中的特定列提取到新文件中

[英]Extract specific column from files of a directory into a new file

I have a collection of 11064 files, and they all have same file extension ReadsPerGene.out.tab.我收集了 11064 个文件,它们都具有相同的文件扩展名 ReadsPerGene.out.tab。 They are in a directory.它们在一个目录中。 All the files have 556 lines, 4 columns.所有文件有 556 行,4 列。

Filenames look like this:
SRR123.ReadsPerGene.out.tab
SRR456.ReadsPerGene.out.tab
SRR555.ReadsPerGene.out.tab
DRR789.ReadsPerGene.out.tab
...

File looks like this:
for SRR123ReadsPerGene.out.tab        for SRR789.ReadsPerGene.out.tab
A    45   67   78                       A    89O  90   34
B    17   40   23                       B    129  96   45
C    27   50   19                       C     60  56   91
...  ...  ...  ...                     ...   ...  ...  ...                                           

First, I want to judge whether the first column of all the files are the same.首先我要判断所有文件的第一列是否相同。

If it's true, I want to create an output.txt file with 665 lines, 11065 columns.如果是这样,我想创建一个包含 665 行、11065 列的 output.txt 文件。 The 1th column is the first column of every file(because they are same).第一列是每个文件的第一列(因为它们相同)。 And 2th of 11065th columns of output.txt are the 2th column of each input file, I want to add the specific filename as the first line for every column. output.txt 的第 11065 列中的第 2 列是每个输入文件的第 2 列,我想将特定文件名添加为每一列的第一行。

The output.txt looks like this:

      SRR123                SRR789              SRR456        ...
A        45                 89O                66            ...
B        17                 129                480           ...
C        27                  60                78            ...
...      ...               ...               ...             ...

The following are my answers.以下是我的回答。 ** **
**1. **1。 get all filenames获取所有文件名

#!/bin/bash
cd ~
filepath=/home/shared/maize/bam_rsem
cd ${filepath}
for file in $(ls *.ReadsPerGene.out.tab)
do
   echo $file >> ~/filename.txt
done

2. get all the first column in one file 2.获取一个文件中的所有第一列

#!/bin/bash
cd ~
OUT=result2.txt
touch $OUT
filepath=/home/shared/maize/bam_rsem/
for file in $(cat filename.txt)
do
   filePATH=`echo ${filepath}$file`
   cut -f 1 $filePATH | sed 1i\ ${file} >$OUT.tmp1
   paste $OUT $OUT.tmp1 >$OUT.tmp
   rm $OUT.tmp1
   mv $OUT.tmp $OUT
done

3. compare whether the first column is identical with other columns in result2.txt 3.比较result2.txt中第一列是否与其他列相同
I have no idea now.我现在不知道。
4. create an output.txt 4.创建一个output.txt

#!/bin/bash
cd ~
OUT=result2.txt
touch $OUT
filepath=/home/shared/maize/bam_rsem/
for file in $(cat filename.txt)
do
   filePATH=`echo ${filepath}$file`
   cut -f 1 $filePATH | sed 1i\ ${file} >$OUT.tmp1
   paste $OUT $OUT.tmp1 >$OUT.tmp
   rm $OUT.tmp1
   mv $OUT.tmp $OUT
done

cut -f 1 result2.txt >$OUT.tmp2
paste $OUT.tmp2 $OUT >$OUT.tmp3
rm $OUT.tmp2
mv $OUT.tmp3 $OUT

What should I do for my script?我应该为我的脚本做什么? It is really slow to execute my script in Linux.在 Linux 中执行我的脚本真的很慢。 Or should I write up a Python script to handle it,but I have never learned python or Perl and I just know a little about Linux.或者我应该写一个 Python 脚本来处理它,但我从未学过 python 或 Perl,我只对 Linux 了解一点。

I'm so sorry that my English is poor, I can not reply in time.很抱歉我的英文不好,没能及时回复。 Anyway, thanks for all your answers!无论如何,感谢您的所有回答!

One in awk.一个在 awk 中。 The filenames to process are in files (due to large number of them):要处理的文件名在files中(由于数量很多):

$ cat files
SRR123.ReadsPerGene.out.tab
SRR789.ReadsPerGene.out.tab

The awk program is to run in the directory with the data files ( split ing the first . separated part of the filename for the header, ie. leading path would make the header name pretty lengthy): awk 程序将在包含数据文件的目录中运行(将文件名的第一个.分隔部分split为标题,即前导路径会使标题名称非常冗长):

$ awk '
BEGIN{OFS="\t"}
{
    files[NR]=$0                                # hash filenames from file files
}
END{
    for(i=1;i<=NR;i++) {                        # loop files
        nr=0
        split(files[i],t,".")
        h[nr]=h[nr] OFS t[1]                    # build header
        while((getline < files[i])>0) {         # using getline to read data records
            nr++                                # d[++nr] order not same in all awks
            d[nr]=d[nr] OFS $2                  # append data fields to previous
            if(i==1) {                          # get headers from first file
                h[(refnr=nr)]=$1
            } else if($1!=h[nr]) {              # check that they stay the same
                print "Nonmatching field name"
                exit                            # or exit without output
            }
        }
        if(nr!=refnr) {                         # also record count must be the same
            print "Nonmatching record count"
            exit
        }
        close(files[i])
    }
    for(i=0;i<=refnr;i++)                       # output part
        print h[i] d[i]
}' files

Output:输出:

        SRR123  SRR789
A       45      89O
B       17      129
C       27      60
...     ...     ...

[++nr] order not same in all awks : Apparently some awks prefer d[++nr]=d[nr] OFS $2 and some d[nr]=d[++nr] OFS $2 so separate nr++ works for both. [++nr] order not same in all awks :显然有些 awks 更喜欢d[++nr]=d[nr] OFS $2和一些d[nr]=d[++nr] OFS $2所以单独的nr++适用于两者.

Update :更新

If the files are in a different path and the filenames in the file files don't have paths included, replace intelligently:如果文件在不同的路径中,并且文件files中的文件名没有包含路径,请智能替换:

split(files[i],t,".")
...
while((getline < files[i])>0) {

with

file="home/shared/maize/bam_rsem/" files[i]
split(file,t,".")
...
while((getline < file)>0) {

AND

close(files[i])

with

close(file)

Try this and let me know if it worked in the comments section of this answer.试试这个,让我知道它是否在这个答案的评论部分有效。

import pandas as pd
import glob

files = sorted(glob.glob("*.log.out", recursive=False))

#dropped_col_1 = list()
#kept_col_2 = list()
drop_files = dict()
keep_files = dict()
ref_file_name = 'SRR123.log.out'
df_ref_file = pd.read_csv(ref_file_name, sep='\t', header=None)
for i, filename in enumerate(files):
    df_file = pd.read_csv(filename, sep='\t', header=None)
    if df_ref_file['0'] != df_file['0']:
        drop_files.update({filename: df_file['0'].tolist()})
        #dropped_col_1.append(df_file['0'].tolist())
    else:
        keep_files.update({filename: df_file['1'].tolist()})        
        #kept_col_2.append(df_file['1'].tolist())

df = pd.DataFrame(keep_files, index=df_ref_file['0'])
df.index.names = ['ID']
df.reset_index(inplace=True)
# check the shape of the dataframe
df.shape

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 为多个xlsx文件目录中的每个文件创建具有特定列总和的新表 - Create new sheet with sums of specific column for each file in directory of multiple xlsx files 将目录和子目录中的特定文件复制到新目录Python中 - Copy specific files from directory and subdirectories into new directory Python 如何从数据框的列中提取特定内容并创建新列? - How to extract specific content from a column of a dataframe and make new column? 使用python从同一目录中的多个文件中提取特定行 - Use python to extract a specific line from multiple files in the same directory 试图从一个目录中的许多文本文件中复制一组特定的字符串并将它们粘贴到一个新的文本文件中 - Trying to copy a set of specific strings from many text files in a directory and paste them in a new text file 从文本文件中提取特定记录并保存到 Python 中的新文件 - Extract specific records from a text file and save to a new file in Python 从blastx输出文件中提取特定条目,写入新文件 - Extract specific entries from blastx output file, write to new file Python zipfile从zip文件内的目录中提取文件 - Python zipfile extract files from directory inside a zip file 使用Python将列表中的特定文件复制到新目录中 - Copying specific files from a list into a new directory using Python 将某些文件从特定子目录移到新目录 - Move certain files from specific subdirectories into a new directory
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM