简体   繁体   English

如何根据R中另一列的条件过滤列

[英]How to filter a column based on a condition from another column in R

I have a huge data table with millions of rows and dozens columns, so performance is a crucial issue for me.我有一个包含数百万行和数十列的巨大数据表,因此性能对我来说是一个至关重要的问题。 The data describes visits to a content site.该数据描述了对内容站点的访问。 I want to compute the ContentId of the earliest (ie minimum hit time) hit of each visit.我想计算每次访问的最早(即最小命中时间)命中的 ContentId。 What I did is: dt[,.(FirstContentOfVisit=ContentID[ContentID != ""][which.min(HitTime)]), by=VisitId,.SDcols=c("ContentID","HitTime")]我所做的是: dt[,.(FirstContentOfVisit=ContentID[ContentID != ""][which.min(HitTime)]), by=VisitId,.SDcols=c("ContentID","HitTime")]

the problem is that I don't know if which.min first computes the min on all the HitTime vector (which I don't want!) or does it only on the filtered HitTime vector (the one which is corresponding to the non-empty ContentID).问题是我不知道 which.min 是首先计算所有 HitTime 向量(我不想要!)的 min 还是仅在过滤后的 HitTime 向量(对应于非空的 ContentID)。

In addition, after I compute it - how can I get the minimal HitTime of the ContentIDs that are different from the first (ie the earliest hit time of the non-first content id).此外,在我计算之后 - 我怎样才能获得与第一个不同的 ContentID 的最小 HitTime(即非第一个内容 ID 的最早点击时间)。

When I tried to have both actions with user-defined functions (first - sort the sub data table and then extract the desired value) it took ages (and actually never stopped), although I have a very strong machine (virtual) with 180 GB RAM.当我尝试使用用户定义的函数执行这两个操作时(首先 - 对子数据表进行排序,然后提取所需的值)它花了很长时间(实际上从未停止过),尽管我有一台非常强大的机器(虚拟),有 180 GB内存。 So I'm looking for an inline solution.所以我正在寻找内联解决方案。

dplyr makes this much easier. dplyr使这更容易。 You didn't share a sample of your data, but I assume the variables of interest look something like this.您没有分享数据样本,但我假设感兴趣的变量看起来像这样。

web <- tibble(
  HitTime = sample(seq(as.Date('2010/01/01'), as.Date('2017/02/23'), by="day"), 1000),
  ContentID = 1:1000,
  SessionID = sample(1:100, 1000, replace = TRUE)
)

Then you can just use group_by and summarise to find the earliest value of HitTime for each SessionID .然后,你可以只使用group_bysummarise发现的最早的值HitTime每个SessionID

web %>%
  group_by(SessionID) %>%
  summarise(HitTime = min(HitTime))

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM