简体   繁体   English

串联来自不同文件夹的fasta文件

[英]Concatenating fasta files from different folders

I have a large numbers of fasta files (these are just text files) in different subfolders. 我在不同的子文件夹中有大量的fasta文件(这些只是文本文件)。 What I need is a way to search through the directories for files that have the same name and concatenate these into a file with the name of the input files. 我需要的是一种在目录中搜索具有相同名称的文件并将它们连接到具有输入文件名的文件中的方法。 I can't do this manually as I have 10000+ genes that I need to do this for. 我无法手动执行此操作,因为我需要执行10000+个基因。

So far I have the following Python code that looks through one of the directories and then uses those file names to search through the other directories. 到目前为止,我拥有以下Python代码,这些代码将通过目录之一进行查找,然后使用这些文件名来搜索其他目录。 This returns a list that has the full path for each file. 这将返回一个列表,其中包含每个文件的完整路径。

    import os
    from os.path import join, abspath

    path = '/directoryforfilelist/'    #Directory for source list
    listing = os.listdir(path)

    for x in listing:
        for root, dirs, files in os.walk('/rootdirectorytosearch/'):
            if x in files:
            pathlist = abspath(join(root,x))

Where I am stuck is how to concatenate the files it returns that have the same name. 卡住的地方是如何串联它返回的具有相同名称的文件。 The results from this script look like this. 该脚本的结果如下所示。

    /directory1/file1.fasta
    /directory2/file1.fasta
    /directory3/file1.fasta
    /directory1/file2.fasta
    /directory2/file2.fasta
    /directory3/file2.fasta

In this case I would need the end result to be two files named file1.fasta and file2.fasta that contain the text from each of the same named files. 在这种情况下,我需要最终结果是两个名为file1.fasta和file2.fasta的文件,其中包含来自每个相同命名文件的文本。

Any leads on where to go from here would be appreciated. 任何线索从这里去哪里都将不胜感激。 While I did this part in Python anyway that gets the job done is fine with me. 无论如何,尽管我在Python中完成了这部分工作,但对我来说还是不错的。 This is being run on a Mac if that matters. 如果重要的话,它可以在Mac上运行。

For each file of your list, allocate the target file in append mode, read each line of your source file and write it to the target file. 对于列表中的每个文件,以附加模式分配目标文件,读取源文件的每一行并将其写入目标文件。

Assuming that the target folder is empty to start with, and is not in /rootdirectorytosearch. 假设目标文件夹开始是空的,并且不在/ rootdirectorytosearch中。

Not tested, but here's roughly what I'd do: 未经测试,但大致是我要做的:

from itertools import groupby
import os

def conc_by_name(names):
    for tail, group in groupby(names, key=os.path.split):
        with open(tail, 'w') as out:
           for name in group:
              with open(name) as f:
                  out.writelines(f)

This will create the files ( file1.fasta and file2.fasta in your example) in the current folder. 这将在当前文件夹中创建文件(在您的示例中为file1.fastafile2.fasta )。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM