I am trying to transform my data manipulation code from dplyr
to data.table
for speed reasons. I am almost there but missing the final step.
I have some sample data to replicate my problem.
c_dt = data.table(u_id=rep(c("u1", "u2"),each=5),
p_id=c("p1", "p1", "p1", "p2","p2", "p1", "p2", "p2", "p2", "p2" ),
c_dt=c("2015-12-01", "2015-12-02", "2015-12-03", "2015-12-02",
"2015-12-05", "2015-12-02", "2015-12-03", "2015-12-04",
"2015-12-05", "2015-12-06"))
I wish to identify the rows where u_id
and p_id
is duplicated; and keep only the row with the minimum c_dt
(essentially keep the first instance). I use the following dplyr
code for this:
c_df <- as.data.frame(c_dt)
cdedup_df <- c_df %>% group_by(p_id, u_id) %>% filter(c_dt == min(c_dt))
Which give the below output
> cdedup_df
Source: local data frame [4 x 3]
Groups: p_id, u_id
u_id p_id c_dt
1 u1 p1 2015-12-01
2 u1 p2 2015-12-02
3 u2 p1 2015-12-02
4 u2 p2 2015-12-03
I have the following data.table
code that correctly identifies the required rows but I am unable to figure out how to just filter and the row as it is.
cdedup_dt <- c_dt[,c_dt == min(c_dt),by = list(u_id, p_id)]
cdedup_dt
u_id p_id V1
1: u1 p1 TRUE
2: u1 p1 FALSE
3: u1 p1 FALSE
4: u1 p2 TRUE
5: u1 p2 FALSE
6: u2 p1 TRUE
7: u2 p2 TRUE
8: u2 p2 FALSE
9: u2 p2 FALSE
10: u2 p2 FALSE
Something like this should do the trick:
c_dt[, list(c_dt=min(c_dt)), by=list(u_id, p_id)]
## u_id p_id c_dt
## 1: u1 p1 2015-12-01
## 2: u1 p2 2015-12-02
## 3: u2 p1 2015-12-02
## 4: u2 p2 2015-12-03
Below my approach. I would expect it scales better for big dataset, as there is no min
by group
, just single sort which data.table makes very efficient and then subset first by group.
setorderv(c_dt, "c_dt")[, .SD[1L], .(u_id, p_id)]
# in data.table 1.9.7+ you can also use `head`
setorderv(c_dt, "c_dt")[, head(.SD, 1L), .(u_id, p_id)]
Below code includes validation of current other answers.
If OP will provide big dataset I can add benchmarks.
library(data.table)
c_dt = data.table(u_id=rep(c("u1", "u2"),each=5), p_id=c("p1", "p1", "p1", "p2","p2", "p1", "p2", "p2", "p2", "p2" ), c_dt=c("2015-12-01", "2015-12-02", "2015-12-03", "2015-12-02", "2015-12-05", "2015-12-02", "2015-12-03", "2015-12-04", "2015-12-05", "2015-12-06"))
zero = c_dt[, list(c_dt=min(c_dt)), by=list(u_id, p_id)]
ananda = c_dt[, list(c_dt = c_dt[c_dt == min(c_dt)]), by = .(u_id, p_id)]
tal = c_dt[, .SD[rank(c_dt, ties.method = c("first")) == 1],by = .(u_id, p_id)]
all.equal(zero, ananda)
#[1] TRUE
all.equal(ananda, tal)
#[1] TRUE
jan = setorderv(c_dt, "c_dt")[, .SD[1L], .(u_id, p_id)]
all.equal(tal, jan)
#[1] TRUE
So indeed you are very close. All you were missing is to pass .SD
in the j column. Let's see how is works:
library(data.table)
c_dt = data.table(u_id=rep(c("u1", "u2"),each=5),
p_id=c("p1", "p1", "p1", "p2","p2", "p1", "p2", "p2", "p2", "p2" ),
c_dt=c("2015-12-01", "2015-12-02",
"2015-12-03", "2015-12-02", "2015-12-05",
"2015-12-02", "2015-12-03", "2015-12-04",
"2015-12-05", "2015-12-06"))
c_dt
u_id p_id c_dt
1: u1 p1 2015-12-01
2: u1 p1 2015-12-02
3: u1 p1 2015-12-03
4: u1 p2 2015-12-02
5: u1 p2 2015-12-05
6: u2 p1 2015-12-02
7: u2 p2 2015-12-03
8: u2 p2 2015-12-04
9: u2 p2 2015-12-05
10: u2 p2 2015-12-06
Now we will group by u_id and p_id and filter by the minimum value of c_df :
cdedup_dt <- c_dt[ , .SD[c_dt == min(c_dt)], by = .(u_id, p_id)]
cdedup_dt
u_id p_id c_dt
1: u1 p1 2015-12-01
2: u1 p2 2015-12-02
3: u2 p1 2015-12-02
4: u2 p2 2015-12-03
Note that .(u_id, p_id)
is equal to list(u_id, p_id)
and .SD
refers to the Subset of the Data.table for each group. All you were missing is that .SD
.
As mentioned by @zero323 min will keep duplicates (which basically means that we have some duplicate rows in our example). If you only wish to keep one record for each group a safer bet will be to use the rank function:
cdedup_dt <- c_dt[, .SD[rank(c_dt, ties.method = c("first")) == 1],by = .(u_id, p_id)]
cdedup_dt
u_id p_id c_dt
1: u1 p1 2015-12-01
2: u1 p2 2015-12-02
3: u2 p1 2015-12-02
4: u2 p2 2015-12-03
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.