在两个数据框之间匹配数据的快速方法[R]

Question

I have two dataframes: df_workingFile and df_groupIDs 我有两个数据框：df_workingFile和df_groupIDs

df_workingFile: df_workingFile：

ID | GroupID | Sales | Date
v  | a1      |  1    |  2011
w  | a1      |  3    |  2010
x  | b1      |  8    |  2007
y  | b1      |  3    |  2006
z  | c3      |  2    |  2006

df_groupIDs: df_groupID：

GroupID | numIDs  | MaxSales 
a1      | 2       |  3       
b1      | 2       |  8       
c3      | 1       |  2

For df_groupIDs, I want to get the ID and Date of the event with the max sales in that group. 对于df_groupID，我想获取该组中销售量最高的事件的ID和日期。 So group "a1" has 2 events in df_workingFile, "v" and "w". 因此，组“ a1”在df_workingFile中有2个事件，“ v”和“ w”。 I want to identify that event "w" has the Max sales value and bring it's information into df_groupIDs. 我想确定事件“ w”具有最大销售价值，并将其信息带到df_groupID中。 The final output should look like this: 最终输出应如下所示：

GroupID | numIDs  | MaxSales | ID | Date
a1      | 2       |  3       | w  | 2010
b1      | 2       |  8       | x  | 2007
c3      | 1       |  2       | z  | 2006

Now here's the problem . 现在是问题所在 。 I wrote code that does this, but it's very inefficient and takes forever to process when I deal with datasets of 50-100K rows. 我编写了实现此目的的代码，但是它效率非常低，并且在处理50-100K行的数据集时需要花费很多时间。 I need help figuring out how to rewrite my code to be more efficient. 我需要帮助弄清楚如何重写我的代码以提高效率。 Here's what I currently have: 这是我目前拥有的：

i = 1
for (groupID in df_groupIDs$groupID) {

    groupEvents <- subset(df_workingFile, df_workingFile$groupID == groupID)
    index <- match(df_groupIDs$maxSales[i], groupEvents$Sales)
    df_groupIDs$ID[i] = groupEvents$ID[index]
    df_groupIDs$Date[i] = groupEvents$Date[index]

    i = i+1
}

Answer 1

Using dplyr : 使用dplyr ：

library(dplyr)

df_workingFile %>% 
  group_by(GroupID) %>%      # for each group id
  arrange(desc(Sales)) %>%   # sort by Sales (descending)
  slice(1) %>%               # keep the top row
  inner_join(df_groupIDs)    # join to df_groupIDs
  select(GroupID, numIDs, MaxSales, ID, Date)
    # keep the columns you want in the order you want

Another simpler method, if the Sales are integers (and can thus be relied on for equality testing with the MaxSales column): 如果 Sales是整数（因此可以依赖MaxSales列进行相等性测试），则是另一种更简单的方法：

inner_join(df_groupIDs, df_workingFile,
           by = c("GroupID" = "GroupID", "MaxSales" = "Sales"))

Answer 2

This makes use of a feature that SQLite has that if max is used on a line then it automatically brings along the row that the maximum came from. 这利用了SQLite的功能，即如果在一行上使用max，则它将自动带来最大值来自的行。

library(sqldf)

sqldf("select g.GroupID, g.numIDs, max(w.Sales) MaxSales, w.ID, w.Date 
       from df_groupIDs g left join df_workingFile w using(GroupID) 
       group by GroupID")

giving: 给予：

  GroupID numIDs MaxSales ID Date
1      a1      2        3  w 2010
2      b1      2        8  x 2007
3      c3      1        2  z 2006

Note: The two input data frames shown reproducibly are: 注意：可重复显示的两个输入数据帧是：

Lines1 <- "
ID | GroupID | Sales | Date
v  | a1      |  1    |  2011
w  | a1      |  3    |  2010
x  | b1      |  8    |  2007
y  | b1      |  3    |  2006
z  | c3      |  2    |  2006"
df_workingFile <- read.table(text = Lines1, header = TRUE, sep = "|", strip.white = TRUE)

Lines2 <- "
GroupID | numIDs  | MaxSales 
a1      | 2       |  3       
b1      | 2       |  8       
c3      | 1       |  2"      

df_groupIDs <- read.table(text = Lines2, header = TRUE, sep = "|", strip.white = TRUE)

在两个数据框之间匹配数据的快速方法[R]

问题描述

2 个解决方案

解决方案1
4 已采纳 2017-08-04 23:48:58

解决方案2
1 2017-08-05 00:02:46

在两个数据框之间匹配数据的快速方法[R]

问题描述

2 个解决方案

解决方案1 4 已采纳 2017-08-04 23:48:58

解决方案2 1 2017-08-05 00:02:46

解决方案1
4 已采纳 2017-08-04 23:48:58

解决方案2
1 2017-08-05 00:02:46