简体   繁体   中英

Split a string into combinations of 2 characters and expand into data frame in R

I'm looking for a clean way to take a row from a table and expand it to multiple rows with near identical information except for one of the columns.

Here's an example where I start with this:

    sex cat         status      pairs
1   F       6,10    Cancer      6,10
2   F       8,10    Cancer      8,10
3   F      12,13    NoCancer    12,13
4   F   3,4,5,10    Cancer      
5   F       7,10    Cancer      7,10
6   F        4,8    NoCancer    4,8

And would like to end up with this:

    sex cat         status      pairs
1   F       6,10    Cancer      6,10
2   F       8,10    Cancer      8,10
3   F      12,13    NoCancer    12,13
4   F   3,4,5,10    Cancer      3,4
4   F   3,4,5,10    Cancer      3,5
4   F   3,4,5,10    Cancer      3,10
4   F   3,4,5,10    Cancer      4,5
4   F   3,4,5,10    Cancer      4,10
4   F   3,4,5,10    Cancer      5,10
5   F       7,10    Cancer      7,10
6   F        4,8    NoCancer    4,8

Now, I know that I can take a string and easily split it apart and then find all possible combinations of size m.

Something like this:

combn(x,2, simplify=F, function(x){ paste(x, collapse=",")} )

While I've done something similar to this in which I break down a string into individual elements and then use plyr (as suggested by the talented @recology_ via this gist )

In my previous example (as can be seen in the gist) the solution ended up being something similar to the following:

df <- data.frame(id =c(11,32,37),
                 name=c("rick","tom","joe"),
                 stringsAsFactors = FALSE)
library(plyr)
foo <- function(x){
  strsplit(x, "")[[1]]
}
ddply(df, .(id, name), summarise, letters=foo(name))

I've been unsuccessful in incorporating the combn() function into this pattern. Any suggestions would be highly appreciated.

Here's a way using data.tables

library(data.table)
DT <- as.data.table(df)
result <- DT[,combn(unlist(strsplit(cat,",")),2,paste,collapse=","),
             by=list(sex,cat,status)]
setnames(result,"V1","pairs")
result
#     sex      cat   status pairs
#  1:   F     6,10   Cancer  6,10
#  2:   F     8,10   Cancer  8,10
#  3:   F    12,13 NoCancer 12,13
#  4:   F 3,4,5,10   Cancer   3,4
#  5:   F 3,4,5,10   Cancer   3,5
#  6:   F 3,4,5,10   Cancer  3,10
#  7:   F 3,4,5,10   Cancer   4,5
#  8:   F 3,4,5,10   Cancer  4,10
#  9:   F 3,4,5,10   Cancer  5,10
# 10:   F     7,10   Cancer  7,10
# 11:   F      4,8 NoCancer   4,8

Note that I imported df using stringsAsFacctors=F , and the F for Female was interpreted as FALSE , so I needed df$sex <- "F" , but this shouldn't affect you.

I tried to edit this into @jlhoward's answer, but it got too long. So writing it separately. This answer basically builds over his nice and compact solution (+1), to address possible speed enhancements.

Firstly, strsplit is vectorised. Therefore, we could avoid splitting on each row by splitting them up-front first by taking advantage of the fact that data.table also allows easy creation and operation of columns of type list :

DT[, splits := strsplit(cat, ",", fixed=TRUE)]

Secondly, if the length of the splits is <= 2L, then we don't have to use combn - as nothing will change. This should result in more speed-ups proportional to the number of such columns.

DT[, { tmp = splits[[1L]]; 
       if (length(tmp) <= 2L) 
           list(pairs=pairs) 
       else 
           list(pairs=as.vector(combn(tmp, 2L, paste, collapse=","))) 
     }, 
by=list(sex, cat, status)]

Here are some benchmarks:

Prepare functions first:

## data.table solution from @jlhoward's
f1 <- function(DT) {
    result <- DT[,combn(unlist(strsplit(cat,",")),2,paste,collapse=","),
                 by=list(sex,cat,status)]
    setnames(result,"V1","pairs")
}

## slightly more efficient in terms of speed
f2 <- function(DT) {
    DT[, splits := strsplit(cat, ",", fixed=TRUE)]
    ans <- DT[, { tmp = splits[[1L]]; 
                 if (length(tmp) <= 2L) 
                   list(pairs=cat) 
                 else 
                   list(pairs=as.vector(combn(tmp, 2L, paste, collapse=","))) 
                },   
           by=list(sex, cat, status)]
}

The dplyr solution also splits for each group. In addition, the do.call(rbind, .) and data.frame(.) calls on each group would be really inefficient. I've simplified it to remove some function calls including do.call(rbind, .) .

The data.frame(.) call however can't be avoided, IIUC, as do(.) requires it.. Anyhow, adding the simplified version to the benchmarks as well:

f3 <- function(df) {
    twosplit <- function(df,varname = "cat"){
       strsplit(df[[varname]],split = ",")[[1L]] %>% 
       combn(2, paste, collapse=",") %>%
       data.frame(pairs = .)
    }
    df %>% group_by(sex, cat, status) %>% do(twosplit(.))
    # the results are not in the same order.. 
}

Update: (added @MatthewPlourde's solution as well)

f4 <- function(d) {
    pairs <- lapply(strsplit(d$cat, ','), function(x) apply(combn(x, 2), 2, paste, collapse=','))
    new.rows <- mapply(function(row, ps) as.data.frame(c(as.list(row), list(pairs=ps))), 
                   row=split(d, 1:nrow(d)), ps=pairs, SIMPLIFY=FALSE)
    do.call(rbind, new.rows)
}

Prepare data:

DT <- rbindlist(replicate(1e4L, df, simplify=FALSE))[, status := 1:nrow(DT)]
DF <- as.data.frame(DT)

Timings:

system.time(ans2 <- f2(DT)) ## 1.3s
system.time(ans1 <- f1(DT)) ## 4.9s
system.time(ans3 <- f3(DF)) ## 212s!
system.time(ans4 <- f4(DF)) ## stopped after 8 mins.

On a final note: You could avoid using combn here (which is really slow), if you're always needing just nC2 , with your own custom function, which I'll leave it to you.

Here is an approach via dplyr , heir to the throne of plyr :

library(dplyr)

twosplit <- function(df,varname = "V2"){
  strsplit(df[[varname]],split = ",") %>%
    unlist %>%
    combn(2, simplify=FALSE, function(x){ paste(x, collapse=",")} ) %>%
    do.call(rbind,.) %>%
    unname %>%
    data.frame(unname(df),pairs = .)
}

df %>%
  group_by(V2) %>%
  do(twosplit(.))

         V2    X1       X2       X3    X4 pairs
1     12,13 FALSE    12,13 NoCancer 12,13 12,13
2  3,4,5,10 FALSE 3,4,5,10   Cancer    NA   3,4
3  3,4,5,10 FALSE 3,4,5,10   Cancer    NA   3,5
4  3,4,5,10 FALSE 3,4,5,10   Cancer    NA  3,10
5  3,4,5,10 FALSE 3,4,5,10   Cancer    NA   4,5
6  3,4,5,10 FALSE 3,4,5,10   Cancer    NA  4,10
7  3,4,5,10 FALSE 3,4,5,10   Cancer    NA  5,10
8       4,8 FALSE      4,8 NoCancer   4,8   4,8
9      6,10 FALSE     6,10   Cancer  6,10  6,10
10     7,10 FALSE     7,10   Cancer  7,10  7,10
11     8,10 FALSE     8,10   Cancer  8,10  8,10

Here's a base R solution:

# define sample data
d <- read.table(text="    sex cat         status      pairs
1   F       6,10    Cancer      6,10
2   F       8,10    Cancer      8,10
3   F      12,13    NoCancer    12,13
4   F   3,4,5,10    Cancer      ''
5   F       7,10    Cancer      7,10
6   F        4,8    NoCancer    4,8", as.is=TRUE)


# add pairs column
pairs <- lapply(strsplit(d$cat, ','), function(x) apply(combn(x, 2), 2, paste, collapse=','))
new.rows <- mapply(function(row, ps) as.data.frame(c(as.list(row), list(pairs=ps))), 
                   row=split(d, 1:nrow(d)), ps=pairs, SIMPLIFY=FALSE)
do.call(rbind, new.rows)
#       sex      cat   status pairs pairs.1
# 1   FALSE     6,10   Cancer  6,10    6,10
# 2   FALSE     8,10   Cancer  8,10    8,10
# 3   FALSE    12,13 NoCancer 12,13   12,13
# 4.1 FALSE 3,4,5,10   Cancer           3,4
# 4.2 FALSE 3,4,5,10   Cancer           3,5
# 4.3 FALSE 3,4,5,10   Cancer          3,10
# 4.4 FALSE 3,4,5,10   Cancer           4,5
# 4.5 FALSE 3,4,5,10   Cancer          4,10
# 4.6 FALSE 3,4,5,10   Cancer          5,10
# 5   FALSE     7,10   Cancer  7,10    7,10
# 6   FALSE      4,8 NoCancer   4,8     4,8

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM