简体   繁体   English

如何有条件地总结组中的其他条目 - R

[英]How to conditionally summarize on other entries in the group - R

In my dataset I have Cartesian coordinates of different items overtime identified by an EventID, event_type, ID number, x position, y position, identity type, broad category, and frame id number.在我的数据集中,我有不同项目加班的笛卡尔坐标,由 EventID、event_type、ID 号、x 位置、y 位置、身份类型、大类和帧 id 号标识。 What I need to do is go for each EventID, event_type pair, and frame id number go through each ID number and calculate which other ID number with a different broad category has the minimum distance from the current row.我需要做的是对每个 EventID、event_type 对和帧 id 编号遍历每个 ID 编号,并计算哪个其他具有不同宽泛类别的 ID 编号与当前行的距离最小。 I would like to avoid using for loops for this because the dataset is several million lines long.我想避免为此使用 for 循环,因为数据集有几百万行长。

I tried formulating this as a group_by and summarize call using dplyr but couldn't quite wrap my head around how I could call a function on the current row x, an y against all other x, and ys and then choose the conditional minimum.我尝试将其制定为 group_by 并使用 dplyr 汇总调用,但无法完全理解如何在当前行 x 上调用函数,针对所有其他 x 和 ys 调用函数,然后选择条件最小值。

two_dim_euclid = function(x1, x2, y1, y2){
  a <- sqrt((x1 - x2)^2 + (y1 - y2)^2)
  return(a)
}


# Example Data
df <- data.frame(stringsAsFactors = FALSE,
                 EventID = c(1003, 1003, 1003, 1003),
                 event_type = c(893, 893, 893, 893),
                 ID_number = c(80427, 2346, 24954, 27765),
                 x = c(86.07, 72.4, 43.08, 80.13),
                 y = c(35.58, 26.43, 34.8, 34.79),
                 identity_type = c("A", "C", "B", "B"),
                 broad_category = c("set1", "set1", "set2", "set2"),
                 frame_id = c(1, 1, 1, 1))
df
#  EventID event_type ID_number x     y     identity_type broad_category frame_id
#1 1003    893        80427     86.07 35.58 A             set1           1
#2 1003    893        2346      72.40 26.43 C             set1           1
#3 1003    893        24954     43.08 34.80 B             set2           1
#4 1003    893        27765     80.13 34.79 B             set2           1

The expected result would return 5.992303 for row 1 it looks for all the entries not belonging to set1 with the same EventID, event_type, and frame_id and then returns the minimum euclidian distance given those parameters.对于第 1 行,预期结果将返回 5.992303,它会查找所有不属于 set1 且具有相同 EventID、event_type 和 frame_id 的条目,然后返回给定这些参数的最小欧几里得距离。

Also, I want to do this for every entry with identity type A. But, the identity_type and broad_category are not always tied together.此外,我想为每个身份类型为 A 的条目执行此操作。但是,identity_type 和 Broad_category 并不总是绑定在一起。 A can belong to either set1 or set2. A 可以属于 set1 或 set2。

Here's a base way that relies on dist() .这是一种依赖于dist()的基本方法。

res <- as.matrix(dist(cbind(df$x, df$y)))
res[res == 0] <- Inf

apply(res, 1, min)

        1         2         3         4 
 5.992303 11.386066 30.491299  5.992303 

# or potentially  more performant
res[cbind(seq_len(nrow(res)), max.col(-res))]

[1]  5.992303 11.386066 30.491299  5.992303

A potential way with would be to do a cartesian join but it would need a lot of memory and would likely be slower: 一种潜在方法是进行笛卡尔连接,但它需要大量内存并且可能会更慢:

library(data.table)
dt <- as.data.table(df)
dt[, ID := .I]

CJ.dt = function(X,Y) {
  stopifnot(is.data.table(X),is.data.table(Y))
  k = NULL
  X = X[, c(k=1, .SD)]
  setkey(X, k)
  Y = Y[, c(k=1, .SD)]
  setkey(Y, NULL)
  X[Y, allow.cartesian=TRUE][, k := NULL][]
}
CJ.dt(dt, dt)[ID != i.ID, min(sqrt((x - i.x)^2 + (y-i.y)^2)), by = i.ID]

   i.ID        V1
1:    1  5.992303
2:    2 11.386066
3:    3 30.491299
4:    4  5.992303

For data.table cartesian join, see here: R: data.table cross-join not working对于 data.table 笛卡尔连接,请参见此处: R: data.table cross-join not working

While I'm not sure about your criteria, it seems that you MUST use for loops in some way if you want to iterate.虽然我不确定您的标准,但如果您想迭代,似乎您必须以某种方式使用 for 循环。 I'm sure others can provide you with Rcpp solutions that are very quick.我相信其他人可以为您提供非常快速的 Rcpp 解决方案。 In the meantime, here is one possible way with base R.同时,这是使用基 R 的一种可能方式。

# In the future, please provide the code to create your example data
dat <- structure(list(EventID = c(1003L, 1003L, 1003L, 1003L), 
                  event_type = c(893L, 893L, 893L, 893L), 
                  ID_number = c(80427L, 2346L, 24954L, 27765L), 
                  x = c(86.07, 72.4, 43.08, 80.13), 
                  y = c(35.58, 26.43, 34.8, 34.79), 
                  identity_type = structure(c(1L, 3L, 2L, 2L), 
                                            .Label = c("A", "B", "C"), 
                                            class = "factor"), 
                  broad_category = structure(c(1L,  1L, 2L, 2L), 
                                             .Label = c("set1", "set2"), 
                                             class = "factor"), 
                  frame_id = c(1L,  1L, 1L, 1L)), 
             .Names = c("EventID", "event_type", "ID_number","x", "y", 
                        "identity_type", "broad_category", "frame_id"), 
             class = "data.frame", row.names = c("1", "2", "3", "4"))

# Define your criteria here
dat$uniqueID <- paste0(dat$EventID, dat$event_type, dat$frame_id, dat$broad_category)
# made your function have two 2 dim vectors instead since that's simpler for passing in
two_dim_euclid = function(a, b) return(sqrt((a[1] - b[1])^2 + (a[2] - b[2])^2))

n <- nrow(dat)
vec <- numeric(n)
for(i in 1:n){
  vec[i] = sum(apply(dat[dat$uniqueID != dat$uniqueID[i], c("x","y")], 1, 
                     function(r) two_dim_euclid(dat[i,c("x","y")], r)), na.rm = T)
  if(i%%10000 == 0) cat(i,"completed...\n") # Progress check since >1mil rows
}
dat$result <- vec

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM