[英]Quick way of matching data between two dataframes [R]
I have two dataframes: df_workingFile and df_groupIDs 我有两个数据框:df_workingFile和df_groupIDs
df_workingFile: df_workingFile:
ID | GroupID | Sales | Date
v | a1 | 1 | 2011
w | a1 | 3 | 2010
x | b1 | 8 | 2007
y | b1 | 3 | 2006
z | c3 | 2 | 2006
df_groupIDs: df_groupID:
GroupID | numIDs | MaxSales
a1 | 2 | 3
b1 | 2 | 8
c3 | 1 | 2
For df_groupIDs, I want to get the ID and Date of the event with the max sales in that group. 对于df_groupID,我想获取该组中销售量最高的事件的ID和日期。 So group "a1" has 2 events in df_workingFile, "v" and "w".
因此,组“ a1”在df_workingFile中有2个事件,“ v”和“ w”。 I want to identify that event "w" has the Max sales value and bring it's information into df_groupIDs.
我想确定事件“ w”具有最大销售价值,并将其信息带到df_groupID中。 The final output should look like this:
最终输出应如下所示:
GroupID | numIDs | MaxSales | ID | Date
a1 | 2 | 3 | w | 2010
b1 | 2 | 8 | x | 2007
c3 | 1 | 2 | z | 2006
Now here's the problem . 现在是问题所在 。 I wrote code that does this, but it's very inefficient and takes forever to process when I deal with datasets of 50-100K rows.
我编写了实现此目的的代码,但是它效率非常低,并且在处理50-100K行的数据集时需要花费很多时间。 I need help figuring out how to rewrite my code to be more efficient.
我需要帮助弄清楚如何重写我的代码以提高效率。 Here's what I currently have:
这是我目前拥有的:
i = 1
for (groupID in df_groupIDs$groupID) {
groupEvents <- subset(df_workingFile, df_workingFile$groupID == groupID)
index <- match(df_groupIDs$maxSales[i], groupEvents$Sales)
df_groupIDs$ID[i] = groupEvents$ID[index]
df_groupIDs$Date[i] = groupEvents$Date[index]
i = i+1
}
Using dplyr
: 使用
dplyr
:
library(dplyr)
df_workingFile %>%
group_by(GroupID) %>% # for each group id
arrange(desc(Sales)) %>% # sort by Sales (descending)
slice(1) %>% # keep the top row
inner_join(df_groupIDs) # join to df_groupIDs
select(GroupID, numIDs, MaxSales, ID, Date)
# keep the columns you want in the order you want
Another simpler method, if the Sales
are integers (and can thus be relied on for equality testing with the MaxSales
column): 如果
Sales
是整数(因此可以依赖MaxSales
列进行相等性测试), 则是另一种更简单的方法:
inner_join(df_groupIDs, df_workingFile,
by = c("GroupID" = "GroupID", "MaxSales" = "Sales"))
This makes use of a feature that SQLite has that if max is used on a line then it automatically brings along the row that the maximum came from. 这利用了SQLite的功能,即如果在一行上使用max,则它将自动带来最大值来自的行。
library(sqldf)
sqldf("select g.GroupID, g.numIDs, max(w.Sales) MaxSales, w.ID, w.Date
from df_groupIDs g left join df_workingFile w using(GroupID)
group by GroupID")
giving: 给予:
GroupID numIDs MaxSales ID Date
1 a1 2 3 w 2010
2 b1 2 8 x 2007
3 c3 1 2 z 2006
Note: The two input data frames shown reproducibly are: 注意:可重复显示的两个输入数据帧是:
Lines1 <- "
ID | GroupID | Sales | Date
v | a1 | 1 | 2011
w | a1 | 3 | 2010
x | b1 | 8 | 2007
y | b1 | 3 | 2006
z | c3 | 2 | 2006"
df_workingFile <- read.table(text = Lines1, header = TRUE, sep = "|", strip.white = TRUE)
Lines2 <- "
GroupID | numIDs | MaxSales
a1 | 2 | 3
b1 | 2 | 8
c3 | 1 | 2"
df_groupIDs <- read.table(text = Lines2, header = TRUE, sep = "|", strip.white = TRUE)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.