I'm looking for a clean way to take a row from a table and expand it to multiple rows with near identical information except for one of the columns.
Here's an example where I start with this:
sex cat status pairs
1 F 6,10 Cancer 6,10
2 F 8,10 Cancer 8,10
3 F 12,13 NoCancer 12,13
4 F 3,4,5,10 Cancer
5 F 7,10 Cancer 7,10
6 F 4,8 NoCancer 4,8
And would like to end up with this:
sex cat status pairs
1 F 6,10 Cancer 6,10
2 F 8,10 Cancer 8,10
3 F 12,13 NoCancer 12,13
4 F 3,4,5,10 Cancer 3,4
4 F 3,4,5,10 Cancer 3,5
4 F 3,4,5,10 Cancer 3,10
4 F 3,4,5,10 Cancer 4,5
4 F 3,4,5,10 Cancer 4,10
4 F 3,4,5,10 Cancer 5,10
5 F 7,10 Cancer 7,10
6 F 4,8 NoCancer 4,8
Now, I know that I can take a string and easily split it apart and then find all possible combinations of size m.
Something like this:
combn(x,2, simplify=F, function(x){ paste(x, collapse=",")} )
While I've done something similar to this in which I break down a string into individual elements and then use plyr
(as suggested by the talented @recology_ via this gist )
In my previous example (as can be seen in the gist) the solution ended up being something similar to the following:
df <- data.frame(id =c(11,32,37),
name=c("rick","tom","joe"),
stringsAsFactors = FALSE)
library(plyr)
foo <- function(x){
strsplit(x, "")[[1]]
}
ddply(df, .(id, name), summarise, letters=foo(name))
I've been unsuccessful in incorporating the combn() function into this pattern. Any suggestions would be highly appreciated.
Here's a way using data.tables
library(data.table)
DT <- as.data.table(df)
result <- DT[,combn(unlist(strsplit(cat,",")),2,paste,collapse=","),
by=list(sex,cat,status)]
setnames(result,"V1","pairs")
result
# sex cat status pairs
# 1: F 6,10 Cancer 6,10
# 2: F 8,10 Cancer 8,10
# 3: F 12,13 NoCancer 12,13
# 4: F 3,4,5,10 Cancer 3,4
# 5: F 3,4,5,10 Cancer 3,5
# 6: F 3,4,5,10 Cancer 3,10
# 7: F 3,4,5,10 Cancer 4,5
# 8: F 3,4,5,10 Cancer 4,10
# 9: F 3,4,5,10 Cancer 5,10
# 10: F 7,10 Cancer 7,10
# 11: F 4,8 NoCancer 4,8
Note that I imported df
using stringsAsFacctors=F
, and the F
for Female
was interpreted as FALSE
, so I needed df$sex <- "F"
, but this shouldn't affect you.
I tried to edit this into @jlhoward's answer, but it got too long. So writing it separately. This answer basically builds over his nice and compact solution (+1), to address possible speed enhancements.
Firstly, strsplit
is vectorised. Therefore, we could avoid splitting on each row by splitting them up-front first by taking advantage of the fact that data.table
also allows easy creation and operation of columns of type list
:
DT[, splits := strsplit(cat, ",", fixed=TRUE)]
Secondly, if the length of the splits is <= 2L, then we don't have to use combn
- as nothing will change. This should result in more speed-ups proportional to the number of such columns.
DT[, { tmp = splits[[1L]];
if (length(tmp) <= 2L)
list(pairs=pairs)
else
list(pairs=as.vector(combn(tmp, 2L, paste, collapse=",")))
},
by=list(sex, cat, status)]
Here are some benchmarks:
## data.table solution from @jlhoward's
f1 <- function(DT) {
result <- DT[,combn(unlist(strsplit(cat,",")),2,paste,collapse=","),
by=list(sex,cat,status)]
setnames(result,"V1","pairs")
}
## slightly more efficient in terms of speed
f2 <- function(DT) {
DT[, splits := strsplit(cat, ",", fixed=TRUE)]
ans <- DT[, { tmp = splits[[1L]];
if (length(tmp) <= 2L)
list(pairs=cat)
else
list(pairs=as.vector(combn(tmp, 2L, paste, collapse=",")))
},
by=list(sex, cat, status)]
}
The dplyr
solution also splits for each group. In addition, the do.call(rbind, .)
and data.frame(.)
calls on each group would be really inefficient. I've simplified it to remove some function calls including do.call(rbind, .)
.
The data.frame(.)
call however can't be avoided, IIUC, as do(.)
requires it.. Anyhow, adding the simplified version to the benchmarks as well:
f3 <- function(df) {
twosplit <- function(df,varname = "cat"){
strsplit(df[[varname]],split = ",")[[1L]] %>%
combn(2, paste, collapse=",") %>%
data.frame(pairs = .)
}
df %>% group_by(sex, cat, status) %>% do(twosplit(.))
# the results are not in the same order..
}
f4 <- function(d) {
pairs <- lapply(strsplit(d$cat, ','), function(x) apply(combn(x, 2), 2, paste, collapse=','))
new.rows <- mapply(function(row, ps) as.data.frame(c(as.list(row), list(pairs=ps))),
row=split(d, 1:nrow(d)), ps=pairs, SIMPLIFY=FALSE)
do.call(rbind, new.rows)
}
DT <- rbindlist(replicate(1e4L, df, simplify=FALSE))[, status := 1:nrow(DT)]
DF <- as.data.frame(DT)
system.time(ans2 <- f2(DT)) ## 1.3s
system.time(ans1 <- f1(DT)) ## 4.9s
system.time(ans3 <- f3(DF)) ## 212s!
system.time(ans4 <- f4(DF)) ## stopped after 8 mins.
On a final note: You could avoid using combn
here (which is really slow), if you're always needing just nC2
, with your own custom function, which I'll leave it to you.
Here is an approach via dplyr
, heir to the throne of plyr
:
library(dplyr)
twosplit <- function(df,varname = "V2"){
strsplit(df[[varname]],split = ",") %>%
unlist %>%
combn(2, simplify=FALSE, function(x){ paste(x, collapse=",")} ) %>%
do.call(rbind,.) %>%
unname %>%
data.frame(unname(df),pairs = .)
}
df %>%
group_by(V2) %>%
do(twosplit(.))
V2 X1 X2 X3 X4 pairs
1 12,13 FALSE 12,13 NoCancer 12,13 12,13
2 3,4,5,10 FALSE 3,4,5,10 Cancer NA 3,4
3 3,4,5,10 FALSE 3,4,5,10 Cancer NA 3,5
4 3,4,5,10 FALSE 3,4,5,10 Cancer NA 3,10
5 3,4,5,10 FALSE 3,4,5,10 Cancer NA 4,5
6 3,4,5,10 FALSE 3,4,5,10 Cancer NA 4,10
7 3,4,5,10 FALSE 3,4,5,10 Cancer NA 5,10
8 4,8 FALSE 4,8 NoCancer 4,8 4,8
9 6,10 FALSE 6,10 Cancer 6,10 6,10
10 7,10 FALSE 7,10 Cancer 7,10 7,10
11 8,10 FALSE 8,10 Cancer 8,10 8,10
Here's a base R solution:
# define sample data
d <- read.table(text=" sex cat status pairs
1 F 6,10 Cancer 6,10
2 F 8,10 Cancer 8,10
3 F 12,13 NoCancer 12,13
4 F 3,4,5,10 Cancer ''
5 F 7,10 Cancer 7,10
6 F 4,8 NoCancer 4,8", as.is=TRUE)
# add pairs column
pairs <- lapply(strsplit(d$cat, ','), function(x) apply(combn(x, 2), 2, paste, collapse=','))
new.rows <- mapply(function(row, ps) as.data.frame(c(as.list(row), list(pairs=ps))),
row=split(d, 1:nrow(d)), ps=pairs, SIMPLIFY=FALSE)
do.call(rbind, new.rows)
# sex cat status pairs pairs.1
# 1 FALSE 6,10 Cancer 6,10 6,10
# 2 FALSE 8,10 Cancer 8,10 8,10
# 3 FALSE 12,13 NoCancer 12,13 12,13
# 4.1 FALSE 3,4,5,10 Cancer 3,4
# 4.2 FALSE 3,4,5,10 Cancer 3,5
# 4.3 FALSE 3,4,5,10 Cancer 3,10
# 4.4 FALSE 3,4,5,10 Cancer 4,5
# 4.5 FALSE 3,4,5,10 Cancer 4,10
# 4.6 FALSE 3,4,5,10 Cancer 5,10
# 5 FALSE 7,10 Cancer 7,10 7,10
# 6 FALSE 4,8 NoCancer 4,8 4,8
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.