简体   繁体   English

使用 data.table 创建序列

[英]Create sequence with data.table

I have a data.table of the format我有一个格式的 data.table

id | pet   | name  
2011-01-01 | "dog" | "a"  
2011-01-02 | "dog" | "b"  
2011-01-03 | "cat" | "c"  
2011-01-04 | "dog" | "a"  
2011-01-05 | "dog" | "some"   
2011-01-06 | "cat" | "thing"

I want to perform an aggregate that concatenates all the dog names that appear before the cat occurs eg,我想执行一个聚合,将猫出现之前出现的所有狗名连接起来,例如,

id | pet   | name   | prior  
2011-01-01 | "dog" | "a"     |  
2011-01-02 | "dog" | "b"     |  
2011-01-03 | "cat" | "c"     |  "a b"  
2011-01-04 | "dog" | "a"     |  
2011-01-05 | "dog" | "some"  |  
2011-01-06 | "cat" | "thing" | "a some"  

Try尝试

 library(data.table)#v1.9.5+
 setDT(df1)[, prior:= paste(name[1:(.N-1)], collapse=' ') ,
    .(group=cumsum(c(0,diff(pet=='cat'))<0))][pet!='cat',  prior:= '']
 #            id pet  name  prior
 #1: 2011-01-01  dog     a       
 #2: 2011-01-02  dog     b       
 #3: 2011-01-03  cat     c    a b
 #4: 2011-01-04  dog     a       
 #5: 2011-01-05  dog  some       
 #6: 2011-01-06  cat thing a some

Or a possible solution with shift (introduced in the devel version ie. v1.9.5), inspired from @David Arenburg's post.或者一个可能的shift解决方案(在开发版本中引入,即 v1.9.5),灵感来自@David Arenburg 的帖子。 Instructions to install the devel version are here .安装开发版的说明在这里

 setDT(df1)[, prior := paste(name[-.N], collapse= ' '), 
    .(group=cumsum(shift(pet, fill='cat')=='cat'))][pet!='cat', prior := '']

data数据

df1 <- structure(list(id = c("2011-01-01 ", "2011-01-02 ", "2011-01-03 ", 
 "2011-01-04 ", "2011-01-05 ", "2011-01-06 "), pet = c("dog", 
"dog", "cat", "dog", "dog", "cat"), name = c("a", "b", "c", "a", 
"some", "thing")), .Names = c("id", "pet", "name"), row.names = c(NA, 
-6L), class = "data.frame")

Here's another option这是另一种选择

indx <- setDT(DT)[, list(.I[.N], paste(name[-.N], collapse = ' ')), 
                    by = list(c(0L, cumsum(pet == "cat")[-nrow(DT)]))]
DT[indx$V1, prior := indx$V2]
DT
#            id pet  name  prior
# 1: 2011-01-01 dog     a     NA
# 2: 2011-01-02 dog     b     NA
# 3: 2011-01-03 cat     c    a b
# 4: 2011-01-04 dog     a     NA
# 5: 2011-01-05 dog  some     NA
# 6: 2011-01-06 cat thing a some

I ran each solution on my data set and compared the run times with rbenchmark.我在我的数据集上运行了每个解决方案,并将运行时间与 rbenchmark 进行了比较。

I cannot share the data set but here some basic info:我无法共享数据集,但这里有一些基本信息:

dim(event_source_causal_parts)
[1] 311127      4

The code for the comparison,比较代码,

require(rbenchmark)
benchmark({
  event_source_causal_parts <- augmented_data_no_software[, list(PROD_ID, Source, Event_Date, Causal_Part_Number)] 
  setDT(event_source_causal_parts)[, prior := paste(Causal_Part_Number[-.N], collapse = ' '), .(group=cumsum(c(0,diff(Source == "Warranty")) < 0))][Source != 'Warranty', prior := '']
 })

benchmark({
  event_source_causal_parts <- augmented_data_no_software[, list(PROD_ID, Source, Event_Date, Causal_Part_Number)] 
  setDT(event_source_causal_parts)[, prior := paste(Causal_Part_Number[-.N], collapse = ' '), .(group=cumsum(shift(Source, fill="Warranty") == "Warranty"))][Source != 'Warranty', prior := ''] 
  })


benchmark({
  event_source_causal_parts <- augmented_data_no_software[, list(PROD_ID, Source, Event_Date, Causal_Part_Number)] 
  indx <- setDT(event_source_causal_parts)[, list(.I[.N], paste(Causal_Part_Number[-.N], collapse = " ")),
                                       by = list(c(0L, cumsum(Source == "Warranty")[-nrow(event_source_causal_parts)]))]
})

The outcome are as follows,结果如下,

  replications elapsed relative user.self sys.self user.child sys.child
1          100   12.91        1     12.76     0.05         NA        NA

  replications elapsed relative user.self sys.self user.child sys.child
1          100    12.7        1     12.66     0.05         NA        NA

  replications elapsed relative user.self sys.self user.child sys.child
1          100   61.97        1     61.65        0         NA        NA

my environment,我的环境,

R version 3.1.2 (2014-10-31)
Platform: x86_64-w64-mingw32/x64 (64-bit)

locale:
[1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252   
[3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C                          
[5] LC_TIME=English_United States.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] rbenchmark_1.0.0 stringr_0.6.2    data.table_1.9.5 vimcom_1.2-6    

loaded via a namespace (and not attached):
[1] chron_2.3-45    grid_3.1.2      lattice_0.20-30 tools_3.1.2     zoo_1.7-11 

R used the Intel MKL math libraries. R 使用了英特尔 MKL 数学库。

Based on these results I think that @akrun 's second solution is the fastest.基于这些结果,我认为 @akrun 的第二个解决方案是最快的。

I ran the test again but now I recompiled data.table with -O3 and updated R to 3.2.0.我再次运行测试,但现在我用 -O3 重新编译了 data.table 并将 R 更新为 3.2.0。 The results are very different:结果非常不同:

  replications elapsed relative user.self sys.self user.child sys.child
1          100   21.22        1     20.73     0.48         NA        NA

  replications elapsed relative user.self sys.self user.child sys.child
1          100   11.31        1     10.39     0.92         NA        NA

  replications elapsed relative user.self sys.self user.child sys.child
1          100   35.77        1     35.53     0.25         NA        NA

So the best solution is even faster under new R with O3 but the second best solution is much slower.因此,在使用 O3 的新 R 下,最佳解决方案甚至更快,但第二个最佳解决方案要慢得多。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM