简体   繁体   中英

Combine different rowcells from a data.table or data.frame based on simple condition

I have a data.frame that looks like this

dput(repex) = structure(list(cat = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L), .Label = c("x", 
"y", "z"), class = "factor"), year = c(1980, 1980, 1982, 1982, 
1990, 1991, 1991, 1991, 1993, 1981, 1981, 1983, 1990, 1996, 1996, 
1996, 1996, 1999, 2002, 1994), org = structure(c(2L, 3L, 4L, 
2L, 5L, 6L, 7L, 8L, 9L, 2L, 3L, 5L, 3L, 10L, 11L, 4L, 9L, 10L, 
3L, 9L), .Label = c("709340", "a", "b", "c", "d", "f", "j", "k", 
"e", "h", "m"), class = "factor")), .Names = c("cat", "year", 
"org"), row.names = c(NA, 20L), class = "data.frame")

I want to create a new object (ideally a data.table or data.frame) in which the elements of org are grouped horizontally behind a specific cat, year combination

I tried to run the following:

repex <- data.table(repex)
setkey(repex,cat,year)
repex[, list(org), by="cat,year"]  #OR
repex[, paste(org,sep="_"), by="cat,year"] # OR
with(repex, tapply(org,paste(cat,year,sep="_"),paste))

The first two data.table options merely copy the entire data.table and the tapply option (applied to repex as either data.table or data.frame) works for a small dataset but creates a list object which is not really convenient as I would need to add the output to another data.frame that is based on the cat_year combination... Additionally for a long dataset (nrow > 100,000) it takes forever, especially as in some cases it needs to paste > 100 org-variants.

My desired output would be a data.table that looks something like this

x 1980 a b
x 1982 a c # org would ideally be rearranged
x 1990 d
x 1991 f j k 
...
y 1996 c e h m
...
z 2002 b

One of your actual problems is using the incorrect arguments to paste . You are looking for collapse , not sep . Another problem is using "data.table" syntax incorrectly.


Update

Considering the comments to this answer, I would suggest something like this instead:

library(data.table)
library(reshape2)
DT <- as.data.table(repex)

setkey(DT, cat, year, org) ## Sorts everything

## Creates a column "var" with the sequence of values ("V1", "V2", and so on)
DT[, var := paste("V", sequence(.N), sep = ""), by = list(cat, year)]
head(DT)
#    cat year org var
# 1:   x 1980   a  V1
# 2:   x 1980   b  V2
# 3:   x 1982   a  V1
# 4:   x 1982   c  V2
# 5:   x 1990   d  V1
# 6:   x 1991   f  V1

Converts that to a "wide" format:

dcast.data.table(DT, cat + year ~ var, value.var="org")
#     cat year V1 V2 V3 V4
#  1:   x 1980  a  b NA NA
#  2:   x 1982  a  c NA NA
#  3:   x 1990  d NA NA NA
#  4:   x 1991  f  j  k NA
#  5:   x 1993  e NA NA NA
#  6:   y 1981  a  b NA NA
#  7:   y 1983  d NA NA NA
#  8:   y 1990  b NA NA NA
#  9:   y 1996  c  e  h  m
# 10:   z 1994  e NA NA NA
# 11:   z 1999  h NA NA NA
# 12:   z 2002  b NA NA NA

Original answer

This is a pretty straightforward aggregate problem:

aggregate(org ~ cat + year, repex, function(x) paste(sort(x), collapse = " "))
#    cat year     org
# 1    x 1980     a b
# 2    y 1981     a b
# 3    x 1982     a c
# 4    y 1983       d
# 5    x 1990       d
# 6    y 1990       b
# 7    x 1991   f j k
# 8    x 1993       e
# 9    z 1994       e
# 10   y 1996 c e h m
# 11   z 1999       h
# 12   z 2002       b

A "data.table" approach:

library(data.table)
DT <- as.data.table(repex)
DT[, list(org = paste(sort(org), collapse = " ")), by = list(cat, year)]

And, to round things out, a "dplyr" approach:

library(dplyr)
repex %.% group_by(cat, year) %.% summarise(org = paste(sort(org), collapse = " "))

@Anandaaaaaaaaaaaaaaaa,

Here's my inelegant way of solving the problem myself. I am sure there is an easier way that takes your advice but just thought I'd share as well.

Step 1: Paste all the org into a list

tmp1 <- with(repex, tapply(org,paste(cat,year,sep="_"), paste))

Step 2: Find the longest length of the list (very inelegantly)

x<-as.vector(NA)
for (i in 1:length(fy_ids)) {
  x[i] <- length(fy_ids[[i]])
  }
max(x)

Step 3: Using the maximum for x, construct a data.frame in which each organization occurs in a new cell (with special thanks to @agstudy for a previous answer

tmp <- do.call(rbind,lapply(tmp1,
               function(y)
                 if(length(y)>0)c(y,rep(NA, max(x)-length(y)))
                             else c(y,rep(NA,max(x)))))

Step 4: Turn tmp into a data.frame

tmp <- data.frame(tmp)

I know it's pretty cumbersome but it has the advantage of making search for specific org a lot easier as each org appears in a different cell.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM