简体   繁体   中英

Separate two letter string in 4th column

I have a data frame - df - with genome data. The final col has a two letter variant.

               id crm     pos allele
160841  rs2237282  11 1273948     AG
160842  rs6417577  11 1276796     AC
165677  rs2151342  11 1199626     GT
165678  rs2749240  11 1258025     AG

I would like to split the final col into two cols of one letter each

               id crm     pos allele allele2
160841  rs2237282  11 1273948     A       G
160842  rs6417577  11 1276796     A       C
165677  rs2151342  11 1199626     G       T
165678  rs2749240  11 1258025     A       G

I have tried, without success, in RStudio 1.1.419, R 3.4.3 using dplyr and tidyr:

  • separate(df, allele, into=c("allele", "allele2"))
  • separate(df, allele, into=c("allele", "allele2"), sep="")
  • separate(df, allele, into=c("allele", "allele2"), sep="\\c")
  • separate(df, allele, into=c("allele", "allele2"), sep=".")
  • separate(df, allele, into=c("allele", "allele2"), sep=.)
  • separate(df, allele, into=c("allele", "allele2"), sep=\\c)

How do I end up with the desired split?

USING BASE r:

HERE=data.frame(A1=character(),A2=character())
cbind(data,strcapture("(.)(.)",data$allele,HERE))
              id crm     pos allele A1 A2
160841 rs2237282  11 1273948     AG  A  G
160842 rs6417577  11 1276796     AC  A  C
165677 rs2151342  11 1199626     GT  G  T
165678 rs2749240  11 1258025     AG  A  G

In separate the sep argument can be numeric and denotes the character positions at which to split so:

separate(df, allele, into = c("allele1", "allele2"), sep = 1)

giving:

              id crm     pos allele1 allele2
160841 rs2237282  11 1273948       A       G
160842 rs6417577  11 1276796       A       C
165677 rs2151342  11 1199626       G       T
165678 rs2749240  11 1258025       A       G
library(tidyverse)

df %>%
    mutate(allele2 = substr(allele, 2, 2)) %>%
    mutate(allele = substr(allele, 1, 1))

In addition to separate , extract is another option from the package. This can be achieved by specifying the capturing group in the regex argument.

library(tidyr)

df %>%
  extract(allele, into = c("allele1", "allele2"), regex = "([ATCG])([ATCG])")
#               id crm     pos allele1 allele2
# 160841 rs2237282  11 1273948       A       G
# 160842 rs6417577  11 1276796       A       C
# 165677 rs2151342  11 1199626       G       T
# 165678 rs2749240  11 1258025       A       G

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM