在 Python 或 R 中连接多个 DNA 序列的文本文件？

Question

我想知道如何使用 Python 或 R 连接外显子/DNA fasta 文件。

示例文件：

到目前为止，我真的很喜欢将 R ape 包用于 cbind 方法，仅仅是因为fill.with.gaps=TRUE属性。 当一个物种缺少外显子时，我真的需要插入间隙。

我的代码：

ex1 <- read.dna("exon1.txt", format="fasta")
ex2 <- read.dna("exon2.txt", format="fasta")
output <- cbind(ex1, ex2, fill.with.gaps=TRUE)
write.dna(output, "Output.txt", format="fasta")

示例：

外显子1.txt

>sp1
AAAA 
>sp2
CCCC

外显子2.txt

>sp1
AGG-G
>sp2
CTGAT
>sp3
CTTTT

输出文件：

>sp1
AAAAAGG-G
>sp2
CCCCCTGAT
>sp3
----CTTTT

到目前为止，当我有多个外显子文件时，我在尝试应用这种技术时遇到了麻烦（试图找出一个循环来打开和执行目录中所有以 .fa 结尾的文件的 cbind 方法），有时并非所有文件都有外显子它们的长度都相同——因此 DNAbin 停止工作。

到目前为止，我有：

file_list <- list.files(pattern=".fa") 

myFunc <- function(x) {
   for (file in file_list) {
     x <- read.dna(file, format="fasta")
     out <- cbind(x, fill.with.gaps=TRUE)
     write.dna(out, "Output.txt", format="fasta")
   }
}

但是，当我运行它并检查我的输出文本文件时，它错过了许多外显子，我认为这是因为并非所有文件都具有相同的外显子长度......或者我的脚本在某处失败而我无法弄清楚：（

有什么想法吗？ 我也可以试试 Python。

Answer 1

如果你更喜欢使用 Linux one liner 你有

      cat exon1.txt exon2.txt > outfile

如果您只想要 outfile 中的唯一记录，请使用

      awk '/^>/{f=!d[$1];d[$1]=1}f' outfile > sorted_outfile

Answer 2

我刚刚在 Python 3 中给出了这个答案：

def read_fasta(fasta): #Function that reads the files
  output = {}
  for line in fasta.split("\n"):
    line = line.strip()
    if not line:
      continue
    if line.startswith(">"):
      active_sequence_name = line[1:]
      if active_sequence_name not in output:
        output[active_sequence_name] = []
      continue
    sequence = line
    output[active_sequence_name].append(sequence)
  return output

with open("exon1.txt", 'r') as file: # read exon1.txt
  file1 = read_fasta(file.read())
with open("exon2.txt", 'r') as file: # read exon2.txt
  file2 = read_fasta(file.read())

finaldict = {}                                     #Concatenate the
for i in list(file1.keys()) + list(file2.keys()):  #both files content
  if i not in file1.keys():
    file1[i] = ["-" * len(file2[i][0])]
  if i not in file2.keys():
    file2[i] = ["-" * len(file1[i][0])]
  finaldict[i] = file1[i] + file2[i]

with open("output.txt", 'w') as file:  # output that in file 
  for k, i in finaldict.items():       # named output.txt
    file.write(">{}\n{}\n".format(k, "".join(i))) #proper formatting

很难完全评论和解释它，它可能对您没有帮助，但这总比没有好：P

我使用了 Łukasz Rogalski 的代码，来自回答将fasta 文件格式读入 Python dict 。

在 Python 或 R 中连接多个 DNA 序列的文本文件？

问题描述

示例文件：

2 个解决方案

解决方案1
1 2017-06-12 11:49:40

解决方案2
0 已采纳 2017-06-09 22:05:43

在 Python 或 R 中连接多个 DNA 序列的文本文件？

问题描述

示例文件：

2 个解决方案

解决方案1 1 2017-06-12 11:49:40

解决方案2 0 已采纳 2017-06-09 22:05:43

解决方案1
1 2017-06-12 11:49:40

解决方案2
0 已采纳 2017-06-09 22:05:43