简体   繁体   English

使用R中每个组的条件筛选组

[英]Filter groups using a condition for each group in R

I have the following two data frames of user events: 我有以下两个用户事件数据框:

data.favorite ( user favorited item at time ) data.favorite( 用户时间收藏的项目

           user   item   time   event
1             1      A      2     fav
2             1      B      6     fav
3             2      D      9     fav
4             3      A      5     fav

data.view ( user viewed item at time ) data.view(在时间 用户观看产品

           user   item   time   event
1             1      A      1    view
2             1      A      3    view
3             1      B      4    view
4             1      B      5    view
5             1      B      7    view
6             1      C      8    view
7             3      A      2    view
8             3      A      9    view

I now only want to keep those events of data.view that occured after that user favorited that item. 我现在只想保留那些用户收藏该项后发生的data.view事件。 Eg row 1 of data.view would be removed, as user 1 favorited item A at 2. The view event at time 3 however would remain, as the user had already favorited the item at that point. 例如,data.view的第1行将被删除,因为用户1在2处收藏项目A.然而,时间3处的视图事件将保留,因为用户已经在该点处收藏了该项目。 So, the result for this example should look like this: 因此,此示例的结果应如下所示:

           user   item   time   event
1             1      A      3    view
2             1      B      7    view
3             3      A      9    view

My current approach is way too slow. 我目前的方法太慢了。 I apply a custom function to data.view: 我将自定义函数应用于data.view:

wasFav = function(u, i, t) {
  favs = data.favorite %>% filter(user == u, item == i, time < t)
  return(nrow(favs) > 0)
}

Any ideas for a faster approach? 有什么想法更快的方法?

Using match with data.frames called data.view and data.fav: 使用与data.frames match data.view和data.fav:

#Find indices of matching users&items
Indices <- match(paste(data.view$user, data.view$item), paste(data.fav$user, data.fav$item))

#add corresponding fav time to data.view:    
data.view$favtime <- data.fav$time[Indices] 

#only keep rows in which time is greater than fav.time:
data.view <- data.view[data.view$time>data.view$favtime & !is.na(data.view$favtime),] 

We can combine the two data frames, group by user and item and then keep only event rows in data.view that occur after a fav . 我们可以按useritem组合两个数据框,然后只保留在fav之后发生的data.view中的event行。 We use cumsum to count up instances of fav and select all rows from the first instance of fav onward. 我们使用cumsum来计算fav实例,并从fav的第一个实例中选择所有行。

The first set of code is for illustration, so you can see what the method is doing. 第一组代码用于说明,因此您可以看到该方法正在执行的操作。 The second set of code does the filtering directly. 第二组代码直接进行过滤。

library(tidyverse)

data.favorite %>% bind_rows(data.view) %>%
  arrange(user, item, time) %>%
  group_by(user, item) %>%
  mutate(sequence = cumsum(event=="fav")) 
  user item time event sequence 1 1 A 1 view 0 2 1 A 2 fav 1 3 1 A 3 view 1 4 1 B 4 view 0 5 1 B 5 view 0 6 1 B 6 fav 1 7 1 B 7 view 1 8 1 C 8 view 0 9 2 D 9 fav 1 10 3 A 2 view 0 11 3 A 5 fav 1 12 3 A 9 view 1 
data.favorite %>% bind_rows(data.view) %>%
  arrange(user, item, time) %>%
  group_by(user, item) %>%
  filter(cumsum(event=="fav") >= 1, event=="view")
  user item time event 1 1 A 3 view 2 1 B 7 view 3 3 A 9 view 

i would join by user and item , assuming that every user-item pair occurs only once in data.favorite. 假设每个用户 - 项目对在data.favorite中只出现一次,我会按useritem加入。 you can then directly compare viewtime with the time an item was favourited and discard all instances where time_viewed < time_favorited: 然后,您可以直接将查看时间与项目受欢迎的时间进行比较,并丢弃time_viewed <time_favorited的所有实例:

data.view %>%
left_join(data.favorite, by=c("user", "item"), suffix=c("_view","_fav")) %>%
filter(time_view > time_fav)

ETA: that was before i learned about the 'non-equi joins' @Henrik mentions in the comments above. ETA:那是在我了解上述评论中@Henrik提及的'非平等加入'之前。 Those sound cool. 那些听起来很酷。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM