select specific lines with the data.table package

Question

I have the following (simplified) dataset:

df <- data.frame(a=c("A","A","B","B","B"),x=c(1,2,3,3,4))
df
  a x
1 A 1
2 A 2
3 B 3
4 B 3
5 B 4

Since I'm working with large datasets, I use the data.table package.

Is there a way to get those lines in df, where x is minimal grouped by a. So in this case, I want to select lines 1,3 and 4.

Something like

df[,min(x),by=a]

But that doesn't give me the lines I wanna have, it just Shows me the minmum values for x grouped by a.

Any suggestions?

Answer 1

library(data.table)
dt <- data.table(a=c("A","A","B","B","B"), x=c(1,2,3,3,4))

These give only unique rows:

dt[, .SD[which.min(x)], by=a]

Or alternatively:

setkeyv(dt, c("a","x"))
dt[unique(dt[,a]), mult="first"]

Since you want to have all ties:

dt[,.SD[x==min(x)], by=a]

You could also use:

setkeyv(dt,c("a","x"))
dt[dt[unique(dt[,a]), mult="first"]]

Which could be more efficient if you have very big groups.

Answer 2

Here you go

R) dt <- data.table(a=c("A","A","B","B","B"),x=c(1,2,3,3,4))
R) dt[dt[,list(IDX=.I[x==min(x)]),by=a]$IDX]
   a x
1: A 1
2: B 3
3: B 3

That should be quicker if you want ties (as I understood you wanted)