I'm using R's ff
package and I've got some ffdf
objects (dimensions around 1.5M x 80) that I need to work with. I'm having some trouble getting my head around the efficient slicing/dicing operations though.
For instance I've got two integer columns named "YEAR" and "AGE", and I want to make a table of AGE when the YEAR is 2005.
One approach is this:
ffwhich <- function(x, expr) {
b <- bit(nrow(x))
for(i in chunk(x)) b[i] <- eval(substitute(expr), x[i,])
b
}
bw <- ffwhich(a.fdf, YEAR==1999)
answer <- table(a.fdf[bw, "AGE"])
The table()
operation is fast but building the bit vector is quite slow. Anyone have any recommendations for doing this better?
The package ffbase
provides many base functions for ff
/ ffdf
objects, including subset.ff
. With a bit of limited testing, it seems that subset.ff
is relatively fast. Try loading ffbase
and then using the simpler code you suggested from a previous comment ( with(subset(a.fdf, YEAR==1999)
).
My approach would be something like this:
system.time({
index <- as.ff( which( a.fdf[,'Location'] == 'exonic') );
table(a.fdf[index,][,'Function']);
});
user system elapsed
1.128 0.172 1.317
Seems to be significantly faster than:
system.time({
bw <- ffwhich(a.fdf, Location=="exonic");
table(a.fdf[bw,'Function']);
})
user system elapsed
24.901 0.208 25.150
YMMV, as these are factors, not characters, and my ffdf is ~4.3M * 42.
identical(table(a.fdf[bw,'Function']), table(a.fdf[index,][,'Function']));
[1] TRUE
Not familiar with manipulating ff
objects, but the problem you describe sounds like a classic tapply()
task:
answer <- tapply(a.fdf$YEAR[a.fdf$YEAR == 1995], a.fdf$AGE[a.fdf$YEAR == 1995], length)
I would assume something like that would move faster than the two-step solution you give above, but maybe I'm misunderstanding how ff
data structures work?
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.