[英]Most efficient way of subsetting dataframes
Can anyone suggest more efficient way of subsetting dataframe without using SQL/indexing/data.table
options? 任何人都可以建议更有效的方法来分组数据帧而不使用SQL/indexing/data.table
选项吗?
I looked for similar questions, and this one suggests indexing option. 我寻找类似的问题, 这个建议索引选项。
Here are ways to subset with timings. 以下是定时子集的方法。
#Dummy data
dat <- data.frame(x = runif(1000000, 1, 1000), y=runif(1000000, 1, 1000))
#Subset and time
system.time(x <- dat[dat$x > 500, ])
# user system elapsed
# 0.092 0.000 0.090
system.time(x <- dat[which(dat$x > 500), ])
# user system elapsed
# 0.040 0.032 0.070
system.time(x <- subset(dat, x > 500))
# user system elapsed
# 0.108 0.004 0.109
EDIT: As Roland suggested I used microbenchmark . 编辑:正如罗兰建议我使用microbenchmark 。 It seems which
performs the best. 似乎which
表现最好。
library("ggplot2")
library("microbenchmark")
#Dummy data
dat <- data.frame(x = runif(1000000, 1, 1000), y=runif(1000000, 1, 1000))
#Benchmark
res <- microbenchmark( dat[dat$x > 500, ],
dat[which(dat$x > 500), ],
subset(dat, x > 500))
#plot
autoplot.microbenchmark(res)
As Roland suggested I used microbenchmark. 正如罗兰建议我使用microbenchmark。 It seems which
performs the best. 似乎which
表现最好。
library("ggplot2")
library("microbenchmark")
#Dummy data
dat <- data.frame(x = runif(1000000, 1, 1000), y=runif(1000000, 1, 1000))
#Benchmark
res <- microbenchmark( dat[dat$x > 500, ],
dat[which(dat$x > 500), ],
subset(dat, x > 500))
#plot
autoplot.microbenchmark(res)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.