Separate two letter string in 4th column

Question

I have a data frame - df - with genome data. The final col has a two letter variant.

               id crm     pos allele
160841  rs2237282  11 1273948     AG
160842  rs6417577  11 1276796     AC
165677  rs2151342  11 1199626     GT
165678  rs2749240  11 1258025     AG

I would like to split the final col into two cols of one letter each

               id crm     pos allele allele2
160841  rs2237282  11 1273948     A       G
160842  rs6417577  11 1276796     A       C
165677  rs2151342  11 1199626     G       T
165678  rs2749240  11 1258025     A       G

I have tried, without success, in RStudio 1.1.419, R 3.4.3 using dplyr and tidyr:

separate(df, allele, into=c("allele", "allele2"))
separate(df, allele, into=c("allele", "allele2"), sep="")
separate(df, allele, into=c("allele", "allele2"), sep="\\c")
separate(df, allele, into=c("allele", "allele2"), sep=".")
separate(df, allele, into=c("allele", "allele2"), sep=.)
separate(df, allele, into=c("allele", "allele2"), sep=\\c)

How do I end up with the desired split?

Answer 1

USING BASE r:

HERE=data.frame(A1=character(),A2=character())
cbind(data,strcapture("(.)(.)",data$allele,HERE))
              id crm     pos allele A1 A2
160841 rs2237282  11 1273948     AG  A  G
160842 rs6417577  11 1276796     AC  A  C
165677 rs2151342  11 1199626     GT  G  T
165678 rs2749240  11 1258025     AG  A  G

Answer 2

In separate the sep argument can be numeric and denotes the character positions at which to split so:

separate(df, allele, into = c("allele1", "allele2"), sep = 1)

giving:

              id crm     pos allele1 allele2
160841 rs2237282  11 1273948       A       G
160842 rs6417577  11 1276796       A       C
165677 rs2151342  11 1199626       G       T
165678 rs2749240  11 1258025       A       G

Answer 3

library(tidyverse)

df %>%
    mutate(allele2 = substr(allele, 2, 2)) %>%
    mutate(allele = substr(allele, 1, 1))

Answer 4

In addition to separate , extract is another option from the tidyr package. This can be achieved by specifying the capturing group in the regex argument.

library(tidyr)

df %>%
  extract(allele, into = c("allele1", "allele2"), regex = "([ATCG])([ATCG])")
#               id crm     pos allele1 allele2
# 160841 rs2237282  11 1273948       A       G
# 160842 rs6417577  11 1276796       A       C
# 165677 rs2151342  11 1199626       G       T
# 165678 rs2749240  11 1258025       A       G

Separate two letter string in 4th column

Question

4 answers

solution1
6 2018-02-05 23:26:09

solution2
5 ACCPTED 2018-02-05 23:28:05

solution3
1 2018-02-05 23:26:20

solution4
0 2018-02-06 02:21:15

Separate two letter string in 4th column

Question

4 answers

solution1 6 2018-02-05 23:26:09

solution2 5 ACCPTED 2018-02-05 23:28:05

solution3 1 2018-02-05 23:26:20

solution4 0 2018-02-06 02:21:15

solution1
6 2018-02-05 23:26:09

solution2
5 ACCPTED 2018-02-05 23:28:05

solution3
1 2018-02-05 23:26:20

solution4
0 2018-02-06 02:21:15