将字符串拆分为2个字符的组合，并扩展为R中的数据帧

Question

我正在寻找一种从表中获取一行并将其扩展为具有几乎相同信息（除了其中一列）的多行的干净方法。

这是我从此开始的示例：

    sex cat         status      pairs
1   F       6,10    Cancer      6,10
2   F       8,10    Cancer      8,10
3   F      12,13    NoCancer    12,13
4   F   3,4,5,10    Cancer      
5   F       7,10    Cancer      7,10
6   F        4,8    NoCancer    4,8

并希望以此结尾：

    sex cat         status      pairs
1   F       6,10    Cancer      6,10
2   F       8,10    Cancer      8,10
3   F      12,13    NoCancer    12,13
4   F   3,4,5,10    Cancer      3,4
4   F   3,4,5,10    Cancer      3,5
4   F   3,4,5,10    Cancer      3,10
4   F   3,4,5,10    Cancer      4,5
4   F   3,4,5,10    Cancer      4,10
4   F   3,4,5,10    Cancer      5,10
5   F       7,10    Cancer      7,10
6   F        4,8    NoCancer    4,8

现在，我知道我可以拿一个字符串并轻松地将其分开，然后找到大小为m的所有可能组合。

像这样：

combn(x,2, simplify=F, function(x){ paste(x, collapse=",")} )

虽然我已经做了类似这样的，我打破一个字符串转换成单独的元素，然后使用的东西plyr （通过由才华横溢的@recology_所建议的这个要点）

在我之前的示例中（从主旨可以看出），解决方案最终类似于以下内容：

df <- data.frame(id =c(11,32,37),
                 name=c("rick","tom","joe"),
                 stringsAsFactors = FALSE)
library(plyr)
foo <- function(x){
  strsplit(x, "")[[1]]
}
ddply(df, .(id, name), summarise, letters=foo(name))

我没有成功将combn（）函数合并到此模式中。 任何建议将不胜感激。

Answer 1

这是使用data.tables的方法

library(data.table)
DT <- as.data.table(df)
result <- DT[,combn(unlist(strsplit(cat,",")),2,paste,collapse=","),
             by=list(sex,cat,status)]
setnames(result,"V1","pairs")
result
#     sex      cat   status pairs
#  1:   F     6,10   Cancer  6,10
#  2:   F     8,10   Cancer  8,10
#  3:   F    12,13 NoCancer 12,13
#  4:   F 3,4,5,10   Cancer   3,4
#  5:   F 3,4,5,10   Cancer   3,5
#  6:   F 3,4,5,10   Cancer  3,10
#  7:   F 3,4,5,10   Cancer   4,5
#  8:   F 3,4,5,10   Cancer  4,10
#  9:   F 3,4,5,10   Cancer  5,10
# 10:   F     7,10   Cancer  7,10
# 11:   F      4,8 NoCancer   4,8

请注意，我使用stringsAsFacctors=F导入了df ，并且F表示Female FALSE ，所以我需要df$sex <- "F" ，但这不会影响您。

Answer 2

我试图将其编辑为@jlhoward的答案，但时间太长。 因此，请单独编写。 这个答案基本上建立在他精巧的解决方案（+1）的基础上，以解决可能的速度提高问题。

首先， strsplit是矢量化的。 因此，我们可以利用data.table还允许轻松创建和操作list类型的列的事实来避免在每一行上进行拆分，从而避免在每一行上拆分：

DT[, splits := strsplit(cat, ",", fixed=TRUE)]

其次，如果分割的长度小于等于2L，那么我们就不必使用combn -因为什么都不会改变。 这应导致与此类列数成比例的更多加速。

DT[, { tmp = splits[[1L]]; 
       if (length(tmp) <= 2L) 
           list(pairs=pairs) 
       else 
           list(pairs=as.vector(combn(tmp, 2L, paste, collapse=","))) 
     }, 
by=list(sex, cat, status)]

以下是一些基准：

首先准备功能：

## data.table solution from @jlhoward's
f1 <- function(DT) {
    result <- DT[,combn(unlist(strsplit(cat,",")),2,paste,collapse=","),
                 by=list(sex,cat,status)]
    setnames(result,"V1","pairs")
}

## slightly more efficient in terms of speed
f2 <- function(DT) {
    DT[, splits := strsplit(cat, ",", fixed=TRUE)]
    ans <- DT[, { tmp = splits[[1L]]; 
                 if (length(tmp) <= 2L) 
                   list(pairs=cat) 
                 else 
                   list(pairs=as.vector(combn(tmp, 2L, paste, collapse=","))) 
                },   
           by=list(sex, cat, status)]
}

dplyr解决方案还会按组dplyr 。 此外，每个组上的do.call(rbind, .) data.frame(.) do.call(rbind, .)和data.frame(.)调用实际上效率很低。 我已经简化了它，以删除一些函数调用，包括do.call(rbind, .) 。

但是，无法避免对data.frame(.)调用，IIUC，就像do(.)要求的那样。无论如何，也将简化版本添加到基准测试中：

f3 <- function(df) {
    twosplit <- function(df,varname = "cat"){
       strsplit(df[[varname]],split = ",")[[1L]] %>% 
       combn(2, paste, collapse=",") %>%
       data.frame(pairs = .)
    }
    df %>% group_by(sex, cat, status) %>% do(twosplit(.))
    # the results are not in the same order.. 
}

更新：（还添加了@MatthewPlourde的解决方案）

f4 <- function(d) {
    pairs <- lapply(strsplit(d$cat, ','), function(x) apply(combn(x, 2), 2, paste, collapse=','))
    new.rows <- mapply(function(row, ps) as.data.frame(c(as.list(row), list(pairs=ps))), 
                   row=split(d, 1:nrow(d)), ps=pairs, SIMPLIFY=FALSE)
    do.call(rbind, new.rows)
}

准备数据：

DT <- rbindlist(replicate(1e4L, df, simplify=FALSE))[, status := 1:nrow(DT)]
DF <- as.data.frame(DT)

时序：

system.time(ans2 <- f2(DT)) ## 1.3s
system.time(ans1 <- f1(DT)) ## 4.9s
system.time(ans3 <- f3(DF)) ## 212s!
system.time(ans4 <- f4(DF)) ## stopped after 8 mins.

最后一点：如果您始终只需要nC2和自己的自定义函数，就可以避免在这里使用combn （这确实很慢），我将留给您。

Answer 3

这是通过dplyr继承人）继承的plyr ：

library(dplyr)

twosplit <- function(df,varname = "V2"){
  strsplit(df[[varname]],split = ",") %>%
    unlist %>%
    combn(2, simplify=FALSE, function(x){ paste(x, collapse=",")} ) %>%
    do.call(rbind,.) %>%
    unname %>%
    data.frame(unname(df),pairs = .)
}

df %>%
  group_by(V2) %>%
  do(twosplit(.))

         V2    X1       X2       X3    X4 pairs
1     12,13 FALSE    12,13 NoCancer 12,13 12,13
2  3,4,5,10 FALSE 3,4,5,10   Cancer    NA   3,4
3  3,4,5,10 FALSE 3,4,5,10   Cancer    NA   3,5
4  3,4,5,10 FALSE 3,4,5,10   Cancer    NA  3,10
5  3,4,5,10 FALSE 3,4,5,10   Cancer    NA   4,5
6  3,4,5,10 FALSE 3,4,5,10   Cancer    NA  4,10
7  3,4,5,10 FALSE 3,4,5,10   Cancer    NA  5,10
8       4,8 FALSE      4,8 NoCancer   4,8   4,8
9      6,10 FALSE     6,10   Cancer  6,10  6,10
10     7,10 FALSE     7,10   Cancer  7,10  7,10
11     8,10 FALSE     8,10   Cancer  8,10  8,10

Answer 4

这是基本的R解决方案：

# define sample data
d <- read.table(text="    sex cat         status      pairs
1   F       6,10    Cancer      6,10
2   F       8,10    Cancer      8,10
3   F      12,13    NoCancer    12,13
4   F   3,4,5,10    Cancer      ''
5   F       7,10    Cancer      7,10
6   F        4,8    NoCancer    4,8", as.is=TRUE)


# add pairs column
pairs <- lapply(strsplit(d$cat, ','), function(x) apply(combn(x, 2), 2, paste, collapse=','))
new.rows <- mapply(function(row, ps) as.data.frame(c(as.list(row), list(pairs=ps))), 
                   row=split(d, 1:nrow(d)), ps=pairs, SIMPLIFY=FALSE)
do.call(rbind, new.rows)
#       sex      cat   status pairs pairs.1
# 1   FALSE     6,10   Cancer  6,10    6,10
# 2   FALSE     8,10   Cancer  8,10    8,10
# 3   FALSE    12,13 NoCancer 12,13   12,13
# 4.1 FALSE 3,4,5,10   Cancer           3,4
# 4.2 FALSE 3,4,5,10   Cancer           3,5
# 4.3 FALSE 3,4,5,10   Cancer          3,10
# 4.4 FALSE 3,4,5,10   Cancer           4,5
# 4.5 FALSE 3,4,5,10   Cancer          4,10
# 4.6 FALSE 3,4,5,10   Cancer          5,10
# 5   FALSE     7,10   Cancer  7,10    7,10
# 6   FALSE      4,8 NoCancer   4,8     4,8

将字符串拆分为2个字符的组合，并扩展为R中的数据帧

问题描述

4 个解决方案

解决方案1
3 已采纳 2014-07-09 20:07:40

解决方案2
1 2014-07-10 13:12:36

首先准备功能：

更新：（还添加了@MatthewPlourde的解决方案）

准备数据：

时序：

解决方案3
0 2014-07-09 20:17:16

解决方案4
0 2014-07-10 13:37:05

将字符串拆分为2个字符的组合，并扩展为R中的数据帧

问题描述

4 个解决方案

解决方案1 3 已采纳 2014-07-09 20:07:40

解决方案2 1 2014-07-10 13:12:36

首先准备功能：

更新：（还添加了@MatthewPlourde的解决方案）

准备数据：

时序：

解决方案3 0 2014-07-09 20:17:16

解决方案4 0 2014-07-10 13:37:05

解决方案1
3 已采纳 2014-07-09 20:07:40

解决方案2
1 2014-07-10 13:12:36

解决方案3
0 2014-07-09 20:17:16

解决方案4
0 2014-07-10 13:37:05