I have a data.frame that looks like this
dput(repex) = structure(list(cat = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L), .Label = c("x",
"y", "z"), class = "factor"), year = c(1980, 1980, 1982, 1982,
1990, 1991, 1991, 1991, 1993, 1981, 1981, 1983, 1990, 1996, 1996,
1996, 1996, 1999, 2002, 1994), org = structure(c(2L, 3L, 4L,
2L, 5L, 6L, 7L, 8L, 9L, 2L, 3L, 5L, 3L, 10L, 11L, 4L, 9L, 10L,
3L, 9L), .Label = c("709340", "a", "b", "c", "d", "f", "j", "k",
"e", "h", "m"), class = "factor")), .Names = c("cat", "year",
"org"), row.names = c(NA, 20L), class = "data.frame")
I want to create a new object (ideally a data.table or data.frame) in which the elements of org
are grouped horizontally behind a specific cat, year
combination
I tried to run the following:
repex <- data.table(repex)
setkey(repex,cat,year)
repex[, list(org), by="cat,year"] #OR
repex[, paste(org,sep="_"), by="cat,year"] # OR
with(repex, tapply(org,paste(cat,year,sep="_"),paste))
The first two data.table options merely copy the entire data.table and the tapply option (applied to repex as either data.table or data.frame) works for a small dataset but creates a list object which is not really convenient as I would need to add the output to another data.frame that is based on the cat_year combination... Additionally for a long dataset (nrow > 100,000) it takes forever, especially as in some cases it needs to paste > 100 org-variants.
My desired output would be a data.table that looks something like this
x 1980 a b
x 1982 a c # org would ideally be rearranged
x 1990 d
x 1991 f j k
...
y 1996 c e h m
...
z 2002 b
One of your actual problems is using the incorrect arguments to paste
. You are looking for collapse
, not sep
. Another problem is using "data.table" syntax incorrectly.
Considering the comments to this answer, I would suggest something like this instead:
library(data.table)
library(reshape2)
DT <- as.data.table(repex)
setkey(DT, cat, year, org) ## Sorts everything
## Creates a column "var" with the sequence of values ("V1", "V2", and so on)
DT[, var := paste("V", sequence(.N), sep = ""), by = list(cat, year)]
head(DT)
# cat year org var
# 1: x 1980 a V1
# 2: x 1980 b V2
# 3: x 1982 a V1
# 4: x 1982 c V2
# 5: x 1990 d V1
# 6: x 1991 f V1
Converts that to a "wide" format:
dcast.data.table(DT, cat + year ~ var, value.var="org")
# cat year V1 V2 V3 V4
# 1: x 1980 a b NA NA
# 2: x 1982 a c NA NA
# 3: x 1990 d NA NA NA
# 4: x 1991 f j k NA
# 5: x 1993 e NA NA NA
# 6: y 1981 a b NA NA
# 7: y 1983 d NA NA NA
# 8: y 1990 b NA NA NA
# 9: y 1996 c e h m
# 10: z 1994 e NA NA NA
# 11: z 1999 h NA NA NA
# 12: z 2002 b NA NA NA
This is a pretty straightforward aggregate
problem:
aggregate(org ~ cat + year, repex, function(x) paste(sort(x), collapse = " "))
# cat year org
# 1 x 1980 a b
# 2 y 1981 a b
# 3 x 1982 a c
# 4 y 1983 d
# 5 x 1990 d
# 6 y 1990 b
# 7 x 1991 f j k
# 8 x 1993 e
# 9 z 1994 e
# 10 y 1996 c e h m
# 11 z 1999 h
# 12 z 2002 b
A "data.table" approach:
library(data.table)
DT <- as.data.table(repex)
DT[, list(org = paste(sort(org), collapse = " ")), by = list(cat, year)]
And, to round things out, a "dplyr" approach:
library(dplyr)
repex %.% group_by(cat, year) %.% summarise(org = paste(sort(org), collapse = " "))
@Anandaaaaaaaaaaaaaaaa,
Here's my inelegant way of solving the problem myself. I am sure there is an easier way that takes your advice but just thought I'd share as well.
Step 1: Paste all the org
into a list
tmp1 <- with(repex, tapply(org,paste(cat,year,sep="_"), paste))
Step 2: Find the longest length of the list (very inelegantly)
x<-as.vector(NA)
for (i in 1:length(fy_ids)) {
x[i] <- length(fy_ids[[i]])
}
max(x)
Step 3: Using the maximum for x, construct a data.frame in which each organization occurs in a new cell (with special thanks to @agstudy for a previous answer
tmp <- do.call(rbind,lapply(tmp1,
function(y)
if(length(y)>0)c(y,rep(NA, max(x)-length(y)))
else c(y,rep(NA,max(x)))))
Step 4: Turn tmp
into a data.frame
tmp <- data.frame(tmp)
I know it's pretty cumbersome but it has the advantage of making search for specific org
a lot easier as each org
appears in a different cell.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.