简体   繁体   English

在两个数据框之间匹配数据的快速方法[R]

[英]Quick way of matching data between two dataframes [R]

I have two dataframes: df_workingFile and df_groupIDs 我有两个数据框:df_workingFile和df_groupIDs

df_workingFile: df_workingFile:

ID | GroupID | Sales | Date
v  | a1      |  1    |  2011
w  | a1      |  3    |  2010
x  | b1      |  8    |  2007
y  | b1      |  3    |  2006
z  | c3      |  2    |  2006

df_groupIDs: df_groupID:

GroupID | numIDs  | MaxSales 
a1      | 2       |  3       
b1      | 2       |  8       
c3      | 1       |  2      

For df_groupIDs, I want to get the ID and Date of the event with the max sales in that group. 对于df_groupID,我想获取该组中销售量最高的事件的ID和日期。 So group "a1" has 2 events in df_workingFile, "v" and "w". 因此,组“ a1”在df_workingFile中有2个事件,“ v”和“ w”。 I want to identify that event "w" has the Max sales value and bring it's information into df_groupIDs. 我想确定事件“ w”具有最大销售价值,并将其信息带到df_groupID中。 The final output should look like this: 最终输出应如下所示:

GroupID | numIDs  | MaxSales | ID | Date
a1      | 2       |  3       | w  | 2010
b1      | 2       |  8       | x  | 2007
c3      | 1       |  2       | z  | 2006

Now here's the problem . 现在是问题所在 I wrote code that does this, but it's very inefficient and takes forever to process when I deal with datasets of 50-100K rows. 我编写了实现此目的的代码,但是它效率非常低,并且在处理50-100K行的数据集时需要花费很多时间。 I need help figuring out how to rewrite my code to be more efficient. 我需要帮助弄清楚如何重写我的代码以提高效率。 Here's what I currently have: 这是我目前拥有的:

i = 1
for (groupID in df_groupIDs$groupID) {

    groupEvents <- subset(df_workingFile, df_workingFile$groupID == groupID)
    index <- match(df_groupIDs$maxSales[i], groupEvents$Sales)
    df_groupIDs$ID[i] = groupEvents$ID[index]
    df_groupIDs$Date[i] = groupEvents$Date[index]

    i = i+1
}

Using dplyr : 使用dplyr

library(dplyr)

df_workingFile %>% 
  group_by(GroupID) %>%      # for each group id
  arrange(desc(Sales)) %>%   # sort by Sales (descending)
  slice(1) %>%               # keep the top row
  inner_join(df_groupIDs)    # join to df_groupIDs
  select(GroupID, numIDs, MaxSales, ID, Date)
    # keep the columns you want in the order you want

Another simpler method, if the Sales are integers (and can thus be relied on for equality testing with the MaxSales column): 如果 Sales是整数(因此可以依赖MaxSales列进行相等性测试), 是另一种更简单的方法:

inner_join(df_groupIDs, df_workingFile,
           by = c("GroupID" = "GroupID", "MaxSales" = "Sales"))

This makes use of a feature that SQLite has that if max is used on a line then it automatically brings along the row that the maximum came from. 这利用了SQLite的功能,即如果在一行上使用max,则它将自动带来最大值来自的行。

library(sqldf)

sqldf("select g.GroupID, g.numIDs, max(w.Sales) MaxSales, w.ID, w.Date 
       from df_groupIDs g left join df_workingFile w using(GroupID) 
       group by GroupID")

giving: 给予:

  GroupID numIDs MaxSales ID Date
1      a1      2        3  w 2010
2      b1      2        8  x 2007
3      c3      1        2  z 2006

Note: The two input data frames shown reproducibly are: 注意:可重复显示的两个输入数据帧是:

Lines1 <- "
ID | GroupID | Sales | Date
v  | a1      |  1    |  2011
w  | a1      |  3    |  2010
x  | b1      |  8    |  2007
y  | b1      |  3    |  2006
z  | c3      |  2    |  2006"
df_workingFile <- read.table(text = Lines1, header = TRUE, sep = "|", strip.white = TRUE)

Lines2 <- "
GroupID | numIDs  | MaxSales 
a1      | 2       |  3       
b1      | 2       |  8       
c3      | 1       |  2"      

df_groupIDs <- read.table(text = Lines2, header = TRUE, sep = "|", strip.white = TRUE)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 r中数据帧之间的两两匹配 - Two by two matching between dataframes in r 在R中的两个数据帧之间捕获匹配和不匹配的值 - Capturing matching and non-matching values between two dataframes in R select 行与 R 数据框中列表中的匹配项的快速方法 - Quick way to select rows with matching terms in a list in data frame in R 两个数据帧之间的值匹配 - Value matching between two dataframes R:有没有办法在两个数据帧列之间进行部分匹配的 Vlookup - R: Is there a way to Vlookup with partial match between two dataframes columns 有没有办法在 R 中的两个数据帧之间生成 plot 相关热图? 这两个数据框具有不同的行名并且维度不相等 - Is there a way to plot correlation heatmap between two dataframes in R? The two dataframes have different row names and are of unequal dimesions 连接两个数据框并覆盖匹配的行 [R] - Join two dataframes and overwrite matching rows [R] 对齐两个数据框以提高R中的匹配精度 - Aligning two dataframes to improve matching accuracy in R 两个数据帧中的“部分”匹配 ID 并在 R 中合并 - "Partial" matching IDs in two dataframes and merging in R 在 Python/R 中两个不同大小的数据帧中进行 1 到 2 匹配 - 1 to 2 matching in two dataframes with different sizes in Python/R
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM