如何根据R中另一列的条件过滤列

Question

I have a huge data table with millions of rows and dozens columns, so performance is a crucial issue for me.我有一个包含数百万行和数十列的巨大数据表，因此性能对我来说是一个至关重要的问题。 The data describes visits to a content site.该数据描述了对内容站点的访问。 I want to compute the ContentId of the earliest (ie minimum hit time) hit of each visit.我想计算每次访问的最早（即最小命中时间）命中的 ContentId。 What I did is: dt[,.(FirstContentOfVisit=ContentID[ContentID != ""][which.min(HitTime)]), by=VisitId,.SDcols=c("ContentID","HitTime")]我所做的是： dt[,.(FirstContentOfVisit=ContentID[ContentID != ""][which.min(HitTime)]), by=VisitId,.SDcols=c("ContentID","HitTime")]

the problem is that I don't know if which.min first computes the min on all the HitTime vector (which I don't want!) or does it only on the filtered HitTime vector (the one which is corresponding to the non-empty ContentID).问题是我不知道 which.min 是首先计算所有 HitTime 向量（我不想要！）的 min 还是仅在过滤后的 HitTime 向量（对应于非空的 ContentID）。

In addition, after I compute it - how can I get the minimal HitTime of the ContentIDs that are different from the first (ie the earliest hit time of the non-first content id).此外，在我计算之后 - 我怎样才能获得与第一个不同的 ContentID 的最小 HitTime（即非第一个内容 ID 的最早点击时间）。

When I tried to have both actions with user-defined functions (first - sort the sub data table and then extract the desired value) it took ages (and actually never stopped), although I have a very strong machine (virtual) with 180 GB RAM.当我尝试使用用户定义的函数执行这两个操作时（首先 - 对子数据表进行排序，然后提取所需的值）它花了很长时间（实际上从未停止过），尽管我有一台非常强大的机器（虚拟），有 180 GB内存。 So I'm looking for an inline solution.所以我正在寻找内联解决方案。

Answer 1

dplyr makes this much easier. dplyr使这更容易。 You didn't share a sample of your data, but I assume the variables of interest look something like this.您没有分享数据样本，但我假设感兴趣的变量看起来像这样。

web <- tibble(
  HitTime = sample(seq(as.Date('2010/01/01'), as.Date('2017/02/23'), by="day"), 1000),
  ContentID = 1:1000,
  SessionID = sample(1:100, 1000, replace = TRUE)
)

Then you can just use group_by and summarise to find the earliest value of HitTime for each SessionID .然后，你可以只使用group_by和summarise发现的最早的值HitTime每个SessionID 。

web %>%
  group_by(SessionID) %>%
  summarise(HitTime = min(HitTime))

如何根据R中另一列的条件过滤列

问题描述

1 个解决方案

解决方案1
0 2017-02-23 13:33:40

如何根据R中另一列的条件过滤列

问题描述

1 个解决方案

解决方案1 0 2017-02-23 13:33:40

解决方案1
0 2017-02-23 13:33:40