I have a data frame - df - with genome data. The final col has a two letter variant.
id crm pos allele
160841 rs2237282 11 1273948 AG
160842 rs6417577 11 1276796 AC
165677 rs2151342 11 1199626 GT
165678 rs2749240 11 1258025 AG
I would like to split the final col into two cols of one letter each
id crm pos allele allele2
160841 rs2237282 11 1273948 A G
160842 rs6417577 11 1276796 A C
165677 rs2151342 11 1199626 G T
165678 rs2749240 11 1258025 A G
I have tried, without success, in RStudio 1.1.419, R 3.4.3 using dplyr and tidyr:
How do I end up with the desired split?
USING BASE r:
HERE=data.frame(A1=character(),A2=character())
cbind(data,strcapture("(.)(.)",data$allele,HERE))
id crm pos allele A1 A2
160841 rs2237282 11 1273948 AG A G
160842 rs6417577 11 1276796 AC A C
165677 rs2151342 11 1199626 GT G T
165678 rs2749240 11 1258025 AG A G
In separate
the sep
argument can be numeric and denotes the character positions at which to split so:
separate(df, allele, into = c("allele1", "allele2"), sep = 1)
giving:
id crm pos allele1 allele2
160841 rs2237282 11 1273948 A G
160842 rs6417577 11 1276796 A C
165677 rs2151342 11 1199626 G T
165678 rs2749240 11 1258025 A G
library(tidyverse)
df %>%
mutate(allele2 = substr(allele, 2, 2)) %>%
mutate(allele = substr(allele, 1, 1))
In addition to separate
, extract
is another option from the tidyr package. This can be achieved by specifying the capturing group in the regex
argument.
library(tidyr)
df %>%
extract(allele, into = c("allele1", "allele2"), regex = "([ATCG])([ATCG])")
# id crm pos allele1 allele2
# 160841 rs2237282 11 1273948 A G
# 160842 rs6417577 11 1276796 A C
# 165677 rs2151342 11 1199626 G T
# 165678 rs2749240 11 1258025 A G
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.