简体   繁体   中英

Split colon- and comma-separated strings in a column into different columns in R

data <- data.frame(col1 = c('0/1:60,4,4,4,4:0.044:4:0:1.00:2352,160:32:28', '0/1:58,4,4,4:0.041:4:0:1.00:2304,150:28:30', '0/1:25,2,1:0.095:1:1:0.500:908,78:9:16'))

data
                                          col1
1 0/1:60,4,4,4,4:0.044:4:0:1.00:2352,160:32:28
2   0/1:58,4,4,4:0.041:4:0:1.00:2304,150:28:30
3       0/1:25,2,1:0.095:1:1:0.500:908,78:9:16

Between the first and second colons, there are several numbers separated by commas. I want to know how to separate the first number and the rest into two columns.

data
                                          col1       col2       col3
1 0/1:60,4,4,4,4:0.044:4:0:1.00:2352,160:32:28         60    4,4,4,4
2   0/1:58,4,4,4:0.041:4:0:1.00:2304,150:28:30         58      4,4,4
3       0/1:25,2,1:0.095:1:1:0.500:908,78:9:16         25        2,1 

We can use extract to match characters that are not a : ( [:]+ ) from the start ( ^ ) of the string followed by a : then capture the digits ( (\\d+) ) followed by a , , capture the second group of characters that doesn't include any : ( ([^:]+) ) followed by a : and the rest of the characters ( .* )

library(dplyr)
library(tidyr)
data %>%
   extract(col1, into = c('col2', 'col3'),
        '^[^:]+:(\\d+),([^:]+):.*', remove = FALSE, convert = TRUE)

-output

#                                          col1 col2    col3
#1 0/1:60,4,4,4,4:0.044:4:0:1.00:2352,160:32:28   60 4,4,4,4
#2   0/1:58,4,4,4:0.041:4:0:1.00:2304,150:28:30   58   4,4,4
#3       0/1:25,2,1:0.095:1:1:0.500:908,78:9:16   25     2,1

The same regex can be used in base R as well with sub and read.table

data[c('col2', 'col3')] <-  read.table(text =
   sub("^[^:]+:(\\d+),([^:]+):.*", "\\1:\\2", data$col1), header = FALSE, sep=":")

Or use strcapture from base R

cbind(data, strcapture("^[^:]+:(\\d+),([^:]+):.*", data$col1, 
      data.frame(col2 = numeric(), col3 = character())))

Base R option using sub :

transform(data, col2 = sub('\\d+/\\d+:(\\d+),.*', '\\1', col1),
                col3 = sub('\\d+/\\d+:\\d+,(.*?):.*', '\\1', col1))

#.                                         col1 col2    col3
#1 0/1:60,4,4,4,4:0.044:4:0:1.00:2352,160:32:28   60 4,4,4,4
#2   0/1:58,4,4,4:0.041:4:0:1.00:2304,150:28:30   58   4,4,4
#3       0/1:25,2,1:0.095:1:1:0.500:908,78:9:16   25     2,1

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM