简体   繁体   中英

Index multiple vectors into table in R

I have three vectors:

position <- c(13, 13, 24, 20, 24, 6, 13)
my_string_allele <- c("T>A", "T>A", "G>C", "C>A", "A>G", "A>G", "G>T")
position_ref <- c("12006", "1108", "13807", "1970", "9030", "2222", "4434")

I want to create a table (starting from the smallest position) as shown below. I want to account for the number of occurrence for each my_string_allele column for each position and have their corresponding position_ref in position_ref column. What would be the simplest way to do this?

position    T>A position_ref    G>C position_ref    C>A position_ref    A>G position_ref    G>T position_ref
6                                                                       1   2222        
13          2   12006, 1108                                                                 1   4434
20                                                  1   1970                
24                               1  13807                               1   9030        

Here is a spread() method which stretches data to the wide format with mutate_all() to count the number of occurrences.

Data

library(tidyverse)
df <- data.frame(position, my_string_allele, position_ref, stringsAsFactors = F)

Code

df %>% group_by(position, my_string_allele) %>%
  mutate(position_ref = paste(position_ref, collapse = ", ")) %>% 
  distinct() %>%
  spread(my_string_allele, position_ref) %>%
  mutate_all(funs(N = if_else(is.na(.), NA_integer_, lengths(str_split(., ", ")))))

Output

  position `A>G` `C>A` `G>C` `G>T` `T>A`       `A>G_N` `C>A_N` `G>C_N` `G>T_N` `T>A_N`
     <dbl> <chr> <chr> <chr> <chr> <chr>         <int>   <int>   <int>   <int>   <int>
1        6 2222  NA    NA    NA    NA                1      NA      NA      NA      NA
2       13 NA    NA    NA    4434  12006, 1108      NA      NA      NA       1       2
3       20 NA    1970  NA    NA    NA               NA       1      NA      NA      NA
4       24 9030  NA    13807 NA    NA                1      NA       1      NA      NA

(You can sort the columns by their column names to get the output you show in the question.)

Full disclosure: I am adapting part of @DarrenTsai's answer with data.table to provide the number of occurrence as well (since it is missing from his answer). Using data.table :

library(data.table)

df <- data.frame(position, my_string_allele, position_ref, stringsAsFactors = F)

setDT(df)

df[, `:=`(position_ref = paste(.N, paste(position_ref, collapse = ", "))),
    by = c("position", "my_string_allele")] %>% 
  unique(., by = c("position", "my_string_allele", "position_ref")) %>% 
  dcast(position ~ my_string_allele, value.var = "position_ref")

Result:

   position    A>G    C>A     G>C    G>T           T>A
1:        6 1 2222   <NA>    <NA>   <NA>          <NA>
2:       13   <NA>   <NA>    <NA> 1 4434 2 12006, 1108
3:       20   <NA> 1 1970    <NA>   <NA>          <NA>
4:       24 1 9030   <NA> 1 13807   <NA>          <NA>

With dplyr (largely based on @DarrenTsai's answer, should upvote his as well):

library(dplyr)

df %>% group_by(position, my_string_allele) %>%
  mutate(position_ref = paste(n(), paste(position_ref, collapse = ", "))) %>%
  distinct() %>%
  tidyr::spread(my_string_allele, position_ref)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM