简体   繁体   中英

Select all samples with a specified base at a particular position

I'm new to programming in R and trying to do a very specific task.

I have a fasta sequence of n samples, which I read in ape :

library(ape)

matrix <- read.dna(myfasta, format="fasta", as.character=TRUE)

This created a matrix, like so:

|    | V1 | V2 | V3 | V4 |...
|------------------------|
|Seq1|  a |  t |  g |  c |...
|Seq2|  a |  t |  g |  a |...
|Seq3|  a |  t |  c |  c |...
|Seq4|  t |  t |  g |  a |...
|... |

Where Seq(n) is the DNA sequence for each sample, and V(n) denotes nucleotide position.

How can I select the sequences that bear a certain nucleotide (eg "a"), at a certain position (eg "V1"), and then return the sequences as a concatenated string?

So for position V1, I'd want to have something like "Seq1, Seq2, Seq3" and for position V4, for the same base, I'd want to have "Seq2, Seq4"

I've tried which() and filter(matrix, V1 == "a") but I'm struggling.

Thanks in advance!

The simplest way is to select the V1 == 'a' rows with boolean indexing, and then extract rownames :

rownames(example[example[,"V1"] == "a", ]) # "No304" "No306"

You mentioned filter , which looks like dplyr . It's a little more cumbersome to use tidyverse methods to manipulate data for which row names are important, as row names are dropped by default.

If you wanted to use filter , you'd have to first save the row names as an explicit column:

library(dplyr)

as.data.frame(example) %>% 
  mutate(sequence = rownames(.), .before = everything()) %>% 
  filter(V1 == "a") %>% 
  select(sequence)

  sequence
1    No304
2    No306

Data (from ape read.dna docs )

library(ape)

cat(">No305",
    "NTTCGAAAAACACACCCACTACTAAAANTTATCAGTCACT",
    ">No304",
    "ATTCGAAAAACACACCCACTACTAAAAATTATCAACCACT",
    ">No306",
    "ATTCGAAAAACACACCCACTACTAAAAATTATCAATCACT",
    file = "exdna.fas", sep = "\n")

example <- read.dna("exdna.fas", format = "fasta", as.character = TRUE)
colnames(example) <- paste0("V", 1:ncol(example))

example
      V1  V2  V3  V4 ...
No305 "n" "t" "t" "c"
No304 "a" "t" "t" "c"
No306 "a" "t" "t" "c"

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM