简体   繁体   English

R:使用“微基准测试”和 ggplot2 到 plot 运行时

[英]R: Using "microbenchmark" and ggplot2 to plot runtimes

I am using the R programming language.我正在使用 R 编程语言。 I want to learn how to measure and plot the run time of difference procedures as the size of the data increases.我想学习如何测量 plot 随着数据量的增加,不同程序的运行时间。

I found a previous stackoverflow post that answers a similar question: Plot the run time of three functions我找到了一个以前的 stackoverflow 帖子,它回答了一个类似的问题:Plot 三个函数的运行时间

It seems that the "microbenchmark" library in R should be able to accomplish this task.看来R中的“microbenchmark”库应该可以完成这个任务。

Suppose I simulate the following data:假设我模拟了以下数据:

#load libraries

library(microbenchmark)
library(dplyr)
library(ggplot2)
library(Rtsne)
library(cluster)
library(dbscan)
library(plotly)

#simulate data

var_1 <- rnorm(1000,1,4)
var_2<-rnorm(1000,10,5)
var_3 <- sample( LETTERS[1:4], 1000, replace=TRUE, prob=c(0.1, 0.2, 0.65, 0.05) )
var_4 <- sample( LETTERS[1:2], 1000, replace=TRUE, prob=c(0.4, 0.6) )


#put them into a data frame called "f"
f <- data.frame(var_1, var_2, var_3, var_4)

#declare var_3 and response_variable as factors
f$var_3 = as.factor(f$var_3)
f$var_4 = as.factor(f$var_4)

#add id
f$ID <- seq_along(f[,1])
Now, I want to measure the run time of 7 different procedures:

#Procedure 1: :

gower_dist <- daisy(f[,-5],
                    metric = "gower")

gower_mat <- as.matrix(gower_dist)


#Procedure 2

lof <- lof(gower_dist, k=3)

#Procedure 3

lof <- lof(gower_dist, k=5)

#Procedure 4

tsne_obj <- Rtsne(gower_dist,  is_distance = TRUE)

tsne_data <- tsne_obj$Y %>%
    data.frame() %>%
    setNames(c("X", "Y")) %>%
    mutate(
           name = f$ID)

#Procedure 5

tsne_obj <- Rtsne(gower_dist, perplexity =10,  is_distance = TRUE)

tsne_data <- tsne_obj$Y %>%
    data.frame() %>%
    setNames(c("X", "Y")) %>%
    mutate(
           name = f$ID)

#Procedure 6

plot = ggplot(aes(x = X, y = Y), data = tsne_data) + geom_point(aes())

#Procedure 7

tsne_obj <- Rtsne(gower_dist,  is_distance = TRUE)

tsne_data <- tsne_obj$Y %>%
  data.frame() %>%
  setNames(c("X", "Y")) %>%
  mutate(
    name = f$ID, 
    lof=lof,
    var1=f$var_1,
    var2=f$var_2,
    var3=f$var_3
    )

p1 <- ggplot(aes(x = X, y = Y, size=lof, key=name, var1=var1, 
  var2=var2, var3=var3), data = tsne_data) + 
  geom_point(shape=1, col="red")+
  theme_minimal()

ggplotly(p1, tooltip = c("lof", "name", "var1", "var2", "var3"))

Using the "microbenchmark" library, I can find out the time of individual functions:使用“microbenchmark”库,我可以找出各个函数的时间:

procedure_1_part_1 <- microbenchmark(daisy(f[,-5],
                    metric = "gower"))

procedure_1_part_2 <-  microbenchmark(as.matrix(gower_dist))

I want to make a graph of the run times like this:我想像这样绘制运行时间图:

https://umap-learn.readthedocs.io/en/latest/benchmarking.html https://umap-learn.readthedocs.io/en/latest/benchmarking.html

Question: Can someone please show me how to make this graph and use the microbenchmark statement for multiple functions at once (for different sizes of the dataframe "f" (for f = 5, 10, 50, 100, 200, 500, 100)?问题:有人可以告诉我如何制作此图并一次使用多个函数的微基准语句(对于 dataframe“f”的不同大小(对于 f = 5、10、50、100、200、500、100) ?

microbench(cbind(gower_dist <- daisy(f[1:5,-5], metric = "gower"), gower_mat <- as.matrix(gower_dist))

microbench(cbind(gower_dist <- daisy(f[1:10,-5], metric = "gower"), gower_mat <- as.matrix(gower_dist))

microbench(cbind(gower_dist <- daisy(f[1:50,-5], metric = "gower"), gower_mat <- as.matrix(gower_dist))

etc ETC

There does not seem to be a straightforward way to do this in R:在 R 中似乎没有直接的方法来执行此操作:

mean(procedure_1_part_1$time)
[1] NA

Warning message:
In mean.default(procedure_1_part_1) :
  argument is not numeric or logical: returning NA

I could manually run each one of these, copy the results into excel and plot them, but this would also take a long time.我可以手动运行其中的每一个,将结果复制到 excel 和 plot 中,但这也需要很长时间。

 tm <- microbenchmark( daisy(f[,-5],
                        metric = "gower"),
    as.matrix(gower_dist))

 tm
Unit: microseconds
                             expr    min     lq     mean  median      uq    max neval cld
 daisy(f[, -5], metric = "gower") 2071.9 2491.4 3144.921 3563.65 3621.00 4727.8   100   b
            as.matrix(gower_dist)  129.3  147.5  194.709  180.80  232.45  414.2   100  a 

Is there a quicker way to make a graph?有没有更快的方法来制作图表?

Thanks谢谢

Here is a solution that benchmarks & charts the first three procedures from the original post, and then charts their average run times with ggplot() .这是一个解决方案,它对原始帖子中的前三个过程进行基准测试和图表绘制,然后使用ggplot()绘制它们的平均运行时间图表。

Setup设置

We start the process by executing the code necessary to create the data from the original post.我们通过执行从原始帖子创建数据所需的代码来启动该过程。

library(dplyr)
library(ggplot2)
library(Rtsne)
library(cluster)
library(dbscan)
library(plotly)
library(microbenchmark)

#simulate data

var_1 <- rnorm(1000,1,4)
var_2<-rnorm(1000,10,5)
var_3 <- sample( LETTERS[1:4], 1000, replace=TRUE, prob=c(0.1, 0.2, 0.65, 0.05) )
var_4 <- sample( LETTERS[1:2], 1000, replace=TRUE, prob=c(0.4, 0.6) )

#put them into a data frame called "f"
f <- data.frame(var_1, var_2, var_3, var_4,ID=1:1000)

#declare var_3 and response_variable as factors
f$var_3 = as.factor(f$var_3)
f$var_4 = as.factor(f$var_4)

Automation of the benchmarking process by data frame size通过数据帧大小自动化基准测试过程

First, we create a vector of data frame sizes to drive the benchmarking.首先,我们创建一个数据帧大小向量来驱动基准测试。

# configure run sizes
sizes <- c(5,10,50,100,200,500,1000)

Next, we take the first procedure and alter it so we can vary the number of observations that are used from the data frame f .接下来,我们采用第一个过程并对其进行更改,以便我们可以改变从数据框f中使用的观察值的数量。 Note that since we need to use the outputs from this procedure in subsequent steps, we use assign() to write them to the global environment.请注意,由于我们需要在后续步骤中使用此过程的输出,因此我们使用assign()将它们写入全局环境。 We also include the number of observations in the object name so we can retrieve them by size in subsequent steps.我们还在 object 名称中包含了观测值的数量,以便我们可以在后续步骤中按大小检索它们。

# Procedure 1: :
proc1 <- function(size){
    assign(paste0("gower_dist_",size), daisy(f[1:size,-5],
                        metric = "gower"),envir = .GlobalEnv)
        
    assign(paste0("gower_mat_",size),as.matrix(get(paste0("gower_dist_",size),envir = .GlobalEnv)),
           envir = .GlobalEnv)
        
}     

To run the benchmark by data frame size we use the sizes vector with lapply() and an anonymous function that executes proc1() repeatedly.为了按数据帧大小运行基准测试,我们使用带有lapply()sizes向量和一个重复执行proc1()的匿名 function。 We also assign the number of observations to a column called obs so we can use it in the plot.我们还将观察次数分配给名为obs的列,以便我们可以在 plot 中使用它。

proc1List <- lapply(sizes,function(x){
        b <- microbenchmark(proc1(x))
        b$obs <- x
        b
})

At this point we have one data frame per benchmark based on size.在这一点上,我们有一个基于大小的每个基准的数据框。 We combine the benchmarks into a single data frame with do.call() and rbind() .我们使用do.call()rbind()将基准测试组合成一个数据框。

proc1summary <- do.call(rbind,(proc1List))

Next, we use the same process with procedures 2 and 3. Notice how we use get() with paste0() to retrieve the correct gower_dist objects by size.接下来,我们使用与过程 2 和 3 相同的过程。注意我们如何使用get()paste0()按大小检索正确的gower_dist对象。

#Procedure 2

proc2 <- function(size){
        lof <- lof(get(paste0("gower_dist_",size),envir = .GlobalEnv), k=3)
}
proc2List <- lapply(sizes,function(x){
    b <- microbenchmark(proc2(x))
    b$obs <- x
    b
})
proc2summary <- do.call(rbind,(proc2List))

#Procedure 3

proc3 <- function(size){
    lof <- lof(get(paste0("gower_dist_",size),envir = .GlobalEnv), k=5)
}

Since k must be less than the number of observations, we adjust the sizes vector to start at 10 for procedure 3.由于k必须小于观测值的数量,因此我们将步骤 3 的sizes向量调整为从 10 开始。

# configure run sizes
sizes <- c(10,50,100,200,500,1000)

proc3List <- lapply(sizes,function(x){
    b <- microbenchmark(proc3(x))
    b$obs <- x
    b
})
proc3summary <- do.call(rbind,(proc3List))

Having generated runtime benchmarks for each of the first three procedures, we bind the summary data, summarize to means with dplyr::summarise() , and plot with ggplot() .为前三个过程中的每一个生成运行时基准后,我们绑定摘要数据,使用dplyr::summarise()总结为均值,使用 ggplot ggplot() ) 总结为 plot。

do.call(rbind,list(proc1summary,proc2summary,proc3summary)) %>% 
    group_by(expr,obs) %>%
    summarise(.,time_ms = mean(time) * .000001) -> proc_time 

The resulting data frame has all the information we need to produce the chart: the procedure used, the number of observations in the original data frame, and the average time in milliseconds.生成的数据框包含生成图表所需的所有信息:使用的过程、原始数据框中的观察次数以及以毫秒为单位的平均时间。

> head(proc_time)
# A tibble: 6 x 3
# Groups:   expr [1]
  expr       obs time_ms
  <fct>    <dbl>   <dbl>
1 proc1(x)     5   0.612
2 proc1(x)    10   0.957
3 proc1(x)    50   1.32 
4 proc1(x)   100   2.53 
5 proc1(x)   200   5.78 
6 proc1(x)   500  25.9 

Finally, we use ggplot() to produce an xy chart, grouping the lines by procedure used.最后,我们使用ggplot()生成 xy 图表,按使用的过程对线进行分组。

ggplot(proc_time,aes(obs,time_ms,group = expr)) +
    geom_line(aes(group = expr),color = "grey80") + 
    geom_point(aes(color = expr))

...and the output: ...和 output:

在此处输入图像描述

Since procedures 2 and 3 vary only slightly, k = 3 vs. k = 5 , they are almost indistinguishable in the chart.由于程序 2 和 3 仅略有不同, k = 3k = 5 ,因此它们在图表中几乎无法区分。

Conclusions结论

With a combination of wrapper functions and lapply() we can generate the information needed to produce the chart requested in the original post.通过包装函数和lapply()的组合,我们可以生成生成原始帖子中请求的图表所需的信息。

The general pattern of modifications is:修改的一般模式是:

  1. Wrap the original procedure in a function that we can use as the unit of analysis for microbenchmark() , and include a size argument将原始过程包装在 function 中,我们可以将其用作microbenchmark()的分析单元,并包含一个size参数
  2. Modify the procedure to use size as a variable where necessary修改过程以在必要时使用size作为变量
  3. Modify the procedure to access objects from previous steps, based on the size argument根据size参数修改过程以访问前面步骤中的对象
  4. Modify the procedure to write its outputs with assign() and size if these are needed for subsequent procedure steps修改过程以使用assign()size写入其输出(如果后续过程步骤需要这些)

We leave automation of benchmarking procedures 4 - 7 by data frame size and integrating them into the plot as an interesting exercise for the original poster.我们根据数据帧大小保留基准测试程序 4 - 7 的自动化,并将它们集成到 plot 中作为原始海报的有趣练习。

My first answer severely misunderstood your question.我的第一个回答严重误解了你的问题。 I hope this can be of some help.我希望这能有所帮助。

library(tidyverse)
library(broom)

# Benchmark your expressions. The following script assumes you name the benchmarks as function_n, but this can (and should be) improved on.
res = microbenchmark(
  rnorm_100 = rnorm(100),
  runif_100 = runif(100),
  rnorm_1000 = runif(1000),
  runif_1000 = runif(1000)
)

# We will be using this gist to tidy the frame
# Source: https://gist.github.com/nutterb/e9e6da4525bacac99899168b5d2f07be
tidy.microbenchmark <- function(x, unit, ...){
  summary(x, unit = unit)
}

# Tidy the frame
res_tidy = tidy(res) %>% 
  mutate(expr = as.character(expr)) %>% 
  separate(expr, c("func","n"), remove = FALSE)

res_tidy
#>         expr  func    n    min      lq     mean  median      uq     max neval
#> 1  rnorm_100 rnorm  100  8.112  9.3420 10.58302 10.2915 10.9755  44.903   100
#> 2  runif_100 runif  100  4.487  5.1180  6.12284  6.1990  6.5925  10.907   100
#> 3 rnorm_1000 rnorm 1000 34.631 36.3155 37.78117 37.2665 38.4510  62.951   100
#> 4 runif_1000 runif 1000 34.668 36.6330 39.48718 37.7995 39.2905 105.325   100

# Plot the runtime for the different expressions by sample number
ggplot(res_tidy, aes(x = n, y = mean, group = func, col = func)) +
  geom_line() +
  geom_point() +
  labs(y = "Runtime", x = "n")

Created on 2020-12-26 by the reprex package (v0.3.0)reprex package (v0.3.0) 创建于 2020-12-26

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM