[英]Extract specific column from files of a directory into a new file
I have a collection of 11064 files, and they all have same file extension ReadsPerGene.out.tab.我收集了 11064 个文件,它们都具有相同的文件扩展名 ReadsPerGene.out.tab。 They are in a directory.
它们在一个目录中。 All the files have 556 lines, 4 columns.
所有文件有 556 行,4 列。
Filenames look like this:
SRR123.ReadsPerGene.out.tab
SRR456.ReadsPerGene.out.tab
SRR555.ReadsPerGene.out.tab
DRR789.ReadsPerGene.out.tab
...
File looks like this:
for SRR123ReadsPerGene.out.tab for SRR789.ReadsPerGene.out.tab
A 45 67 78 A 89O 90 34
B 17 40 23 B 129 96 45
C 27 50 19 C 60 56 91
... ... ... ... ... ... ... ...
First, I want to judge whether the first column of all the files are the same.首先我要判断所有文件的第一列是否相同。
If it's true, I want to create an output.txt file with 665 lines, 11065 columns.如果是这样,我想创建一个包含 665 行、11065 列的 output.txt 文件。 The 1th column is the first column of every file(because they are same).
第一列是每个文件的第一列(因为它们相同)。 And 2th of 11065th columns of output.txt are the 2th column of each input file, I want to add the specific filename as the first line for every column.
output.txt 的第 11065 列中的第 2 列是每个输入文件的第 2 列,我想将特定文件名添加为每一列的第一行。
The output.txt looks like this:
SRR123 SRR789 SRR456 ...
A 45 89O 66 ...
B 17 129 480 ...
C 27 60 78 ...
... ... ... ... ...
The following are my answers.以下是我的回答。 **
**
**1. **1。 get all filenames
获取所有文件名
#!/bin/bash
cd ~
filepath=/home/shared/maize/bam_rsem
cd ${filepath}
for file in $(ls *.ReadsPerGene.out.tab)
do
echo $file >> ~/filename.txt
done
2. get all the first column in one file 2.获取一个文件中的所有第一列
#!/bin/bash
cd ~
OUT=result2.txt
touch $OUT
filepath=/home/shared/maize/bam_rsem/
for file in $(cat filename.txt)
do
filePATH=`echo ${filepath}$file`
cut -f 1 $filePATH | sed 1i\ ${file} >$OUT.tmp1
paste $OUT $OUT.tmp1 >$OUT.tmp
rm $OUT.tmp1
mv $OUT.tmp $OUT
done
3. compare whether the first column is identical with other columns in result2.txt 3.比较result2.txt中第一列是否与其他列相同
I have no idea now.我现在不知道。
4. create an output.txt 4.创建一个output.txt
#!/bin/bash
cd ~
OUT=result2.txt
touch $OUT
filepath=/home/shared/maize/bam_rsem/
for file in $(cat filename.txt)
do
filePATH=`echo ${filepath}$file`
cut -f 1 $filePATH | sed 1i\ ${file} >$OUT.tmp1
paste $OUT $OUT.tmp1 >$OUT.tmp
rm $OUT.tmp1
mv $OUT.tmp $OUT
done
cut -f 1 result2.txt >$OUT.tmp2
paste $OUT.tmp2 $OUT >$OUT.tmp3
rm $OUT.tmp2
mv $OUT.tmp3 $OUT
What should I do for my script?我应该为我的脚本做什么? It is really slow to execute my script in Linux.
在 Linux 中执行我的脚本真的很慢。 Or should I write up a Python script to handle it,but I have never learned python or Perl and I just know a little about Linux.
或者我应该写一个 Python 脚本来处理它,但我从未学过 python 或 Perl,我只对 Linux 了解一点。
I'm so sorry that my English is poor, I can not reply in time.很抱歉我的英文不好,没能及时回复。 Anyway, thanks for all your answers!
无论如何,感谢您的所有回答!
One in awk.一个在 awk 中。 The filenames to process are in
files
(due to large number of them):要处理的文件名在
files
中(由于数量很多):
$ cat files
SRR123.ReadsPerGene.out.tab
SRR789.ReadsPerGene.out.tab
The awk program is to run in the directory with the data files ( split
ing the first .
separated part of the filename for the header, ie. leading path would make the header name pretty lengthy): awk 程序将在包含数据文件的目录中运行(将文件名的第一个
.
分隔部分split
为标题,即前导路径会使标题名称非常冗长):
$ awk '
BEGIN{OFS="\t"}
{
files[NR]=$0 # hash filenames from file files
}
END{
for(i=1;i<=NR;i++) { # loop files
nr=0
split(files[i],t,".")
h[nr]=h[nr] OFS t[1] # build header
while((getline < files[i])>0) { # using getline to read data records
nr++ # d[++nr] order not same in all awks
d[nr]=d[nr] OFS $2 # append data fields to previous
if(i==1) { # get headers from first file
h[(refnr=nr)]=$1
} else if($1!=h[nr]) { # check that they stay the same
print "Nonmatching field name"
exit # or exit without output
}
}
if(nr!=refnr) { # also record count must be the same
print "Nonmatching record count"
exit
}
close(files[i])
}
for(i=0;i<=refnr;i++) # output part
print h[i] d[i]
}' files
Output:输出:
SRR123 SRR789
A 45 89O
B 17 129
C 27 60
... ... ...
[++nr] order not same in all awks
: Apparently some awks prefer d[++nr]=d[nr] OFS $2
and some d[nr]=d[++nr] OFS $2
so separate nr++
works for both. [++nr] order not same in all awks
:显然有些 awks 更喜欢d[++nr]=d[nr] OFS $2
和一些d[nr]=d[++nr] OFS $2
所以单独的nr++
适用于两者.
Update :更新:
If the files are in a different path and the filenames in the file files
don't have paths included, replace intelligently:如果文件在不同的路径中,并且文件
files
中的文件名没有包含路径,请智能替换:
split(files[i],t,".")
...
while((getline < files[i])>0) {
with和
file="home/shared/maize/bam_rsem/" files[i]
split(file,t,".")
...
while((getline < file)>0) {
AND和
close(files[i])
with和
close(file)
Try this and let me know if it worked in the comments section of this answer.试试这个,让我知道它是否在这个答案的评论部分有效。
import pandas as pd
import glob
files = sorted(glob.glob("*.log.out", recursive=False))
#dropped_col_1 = list()
#kept_col_2 = list()
drop_files = dict()
keep_files = dict()
ref_file_name = 'SRR123.log.out'
df_ref_file = pd.read_csv(ref_file_name, sep='\t', header=None)
for i, filename in enumerate(files):
df_file = pd.read_csv(filename, sep='\t', header=None)
if df_ref_file['0'] != df_file['0']:
drop_files.update({filename: df_file['0'].tolist()})
#dropped_col_1.append(df_file['0'].tolist())
else:
keep_files.update({filename: df_file['1'].tolist()})
#kept_col_2.append(df_file['1'].tolist())
df = pd.DataFrame(keep_files, index=df_ref_file['0'])
df.index.names = ['ID']
df.reset_index(inplace=True)
# check the shape of the dataframe
df.shape
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.