Select 在特定 position 处具有指定碱基的所有样本

Question

我是 R 编程的新手，并试图完成一项非常具体的任务。

我有一个 n 样本的 fasta 序列，我在ape中阅读：

library(ape)

matrix <- read.dna(myfasta, format="fasta", as.character=TRUE)

这创建了一个矩阵，如下所示：

|    | V1 | V2 | V3 | V4 |...
|------------------------|
|Seq1|  a |  t |  g |  c |...
|Seq2|  a |  t |  g |  a |...
|Seq3|  a |  t |  c |  c |...
|Seq4|  t |  t |  g |  a |...
|... |

其中 Seq(n) 是每个样品的 DNA 序列，V(n) 表示核苷酸 position。

我怎样才能 select 在某个 position（例如“V1”）处带有某个核苷酸（例如“a”）的序列，然后将这些序列作为连接字符串返回？

所以对于 position V1，我想要类似“Seq1，Seq2，Seq3”的东西，对于 position V4，对于相同的基础，我想要“Seq2，Seq4”

我试过which()和filter(matrix, V1 == "a")但我很挣扎。

提前致谢！

Answer 1

最简单的方法是 select V1 == 'a'行与rownames索引，然后提取行名：

rownames(example[example[,"V1"] == "a", ]) # "No304" "No306"

您提到了filter ，它看起来像dplyr 。 使用 tidyverse 方法来操作对行名很重要的数据有点麻烦，因为默认情况下会删除行名。

如果您想使用filter ，您必须首先将行名保存为显式列：

library(dplyr)

as.data.frame(example) %>% 
  mutate(sequence = rownames(.), .before = everything()) %>% 
  filter(V1 == "a") %>% 
  select(sequence)

  sequence
1    No304
2    No306

数据（来自ape read.dna 文档）

library(ape)

cat(">No305",
    "NTTCGAAAAACACACCCACTACTAAAANTTATCAGTCACT",
    ">No304",
    "ATTCGAAAAACACACCCACTACTAAAAATTATCAACCACT",
    ">No306",
    "ATTCGAAAAACACACCCACTACTAAAAATTATCAATCACT",
    file = "exdna.fas", sep = "\n")

example <- read.dna("exdna.fas", format = "fasta", as.character = TRUE)
colnames(example) <- paste0("V", 1:ncol(example))

example
      V1  V2  V3  V4 ...
No305 "n" "t" "t" "c"
No304 "a" "t" "t" "c"
No306 "a" "t" "t" "c"

Select 在特定 position 处具有指定碱基的所有样本

问题描述

1 个解决方案

解决方案1
2 已采纳 2020-12-30 16:35:54

Select 在特定 position 处具有指定碱基的所有样本

问题描述

1 个解决方案

解决方案1 2 已采纳 2020-12-30 16:35:54

解决方案1
2 已采纳 2020-12-30 16:35:54