简体   繁体   English

从 R 中的数据框中提取值

[英]Extract values from data frame in R

Using R, I would like to find out which Samples (S1, S2, S3, S4, S5) fulfill the following criteria:contain minimally one value (x, y or z) bigger than 4. Thanks, Alex.使用 R,我想找出哪些样本(S1、S2、S3、S4、S5)满足以下条件:至少包含一个大于 4 的值(x、y 或 z)。谢谢,亚历克斯。

 Sample    x    y    z <br>
     S1 -0.3  5.3  2.5 <br>
     S2  0.4  0.2 -1.2 <br>
     S3  1.2 -0.6  3.2 <br>
     S4  4.3  0.7  5.7 <br>
     S5  2.4  4.3  2.3 <br>

You could try a call to apply - for example: 您可以尝试致电apply -例如:

> apply(dataFrameOfSamples,1,function(x)any(x > 4))
   S1    S2    S3    S4    S5
 TRUE FALSE FALSE  TRUE  TRUE

How does this sound? 听起来如何? Copy your data into your clipboard and execute the following commands: 将数据复制到剪贴板中并执行以下命令:

dta <- read.table("clipboard", header = T)
apply(dta[2:4], 1, function(x) ifelse(max(x) >= 4, 1, 0))

With many rows this could be more efficient: 对于许多行,这可能会更有效:

do.call(pmax, X[c("x","y","z")]) > 4

On your data 根据您的数据

ex <- data.frame(
  Sample = c("S1", "S2", "S3", "S4", "S5"),
  x = c(-0.3, 0.4, 1.2, 4.3, 2.4),
  y = c( 5.3, 0.2,-0.6, 0.7, 4.3),
  z = c( 2.5,-1.2, 3.2, 5.7, 2.3)
)

do.call(pmax, ex[c("x","y","z")]) > 4
# [1]  TRUE FALSE FALSE  TRUE  TRUE

Benchmark summary : the pmax approach could not only be more efficient as @MArek suggest, it is a lot more efficient than the other two.基准总结pmax方法不仅可以像@MArek 建议的那样更有效,而且比其他两个方法更有效。 It loses some of it's advantage when a data frame has more columns but it is still the fastest approach.当数据帧具有更多列时,它会失去一些优势,但它仍然是最快的方法。

Benchmark .基准 Being an empiricist I took the liberty of comparing the three approaches.作为一名经验主义者,我冒昧地比较了这三种方法。 The 3 approaches were compared using microbenchmark .使用microbenchmark比较了这 3 种方法。 These are varied characteristics:这些是不同的特征:

  • The three approaches here called “pmax” (by @Marek), “apply.any” (by @nullglob), “apply.ifelse” (by @roman-luštrik)这里的三种方法称为“pmax”(@Marek)、“apply.any”(@nullglob)、“apply.ifelse”(@roman-luštrik)
  • Size of data frame数据框大小
    • Number of rows: 10, 500, 2500行数:10、500、2500
    • Number of columns: 4, 500, 2500列数:4、500、2500

The performance of the pmax approach is remarkable. pmax方法的性能非常出色。 It is a lot faster.它要快得多。 For the smallest data frame it has an advantage by a factor of > 3. For 2500 rows and 4 columns pmax is over 97 times faster than “apply.ifelse” and 57 times faster than "apply.any".对于最小的数据帧,它的优势大于 3。对于 2500 行和 4 列, pmax比“apply.ifelse”快 97 倍以上,比“apply.any”快 57 倍。

运行基准测试所需的持续时间

The following images shows the performance of the three approaches in relation to pmax .下图显示了与pmax相关的三种方法的性能。 Hence, the performance of pmax in each combination of rows and columns is always 1 (ie 100%) and the other two approaches are shown in relation to that.因此, pmax在每个行和列组合中的性能始终为 1(即 100%),其他两种方法也与此相关。 It shows that the performance of pmax is superior especially for the data frames with fewer columns.这表明pmax的性能优越,尤其是对于列数较少的数据帧。

Since pmax seems to lose it's advantage with an increasing number of column, it could be that the other approaches become faster with a large number of columns.由于pmax似乎随着列数的增加而失去了它的优势,因此其他方法可能会随着列数的增加而变得更快。

基准标准化为 pmax 方法

Code used in this post:这篇文章中使用的代码:

library(microbenchmark)

TotalResult <- list()
for (Width in c(4, 500, 1000)) {
  for (Size in c(10, 500, 2500)) {
    ex <- data.frame(
      Sample = paste0("S", 1:Size),
      x = runif(Size, 0, 6),
      y = runif(Size, 0, 6),
      z = runif(Size, 0, 6)
    )
    if (Width > 4)
      for (i in 5:Width)
        ex[[i]] <- runif(Size, 0, 6)

    result <- microbenchmark(
      pmax = { do.call(pmax, ex[2:Width]) > 4 },
      apply.ifelse = { apply(ex[2:Width], 1, function(x) ifelse(max(x) > 4, TRUE, FALSE)) },
      apply.any = apply(ex[2:Width], 1, function(x) any(x > 4)),
      check = "identical"
    )
    cat("Benchmark: Size =", Size, "// Width =", Width, "\n")
    print(result)
    #boxplot(result)
    TotalResult <- c(TotalResult, list(list(Size=Size, Width=Width, Benchmark=result)))
  }
}


Comparison <- data.frame(Approach = character(),
                         Rows   = numeric(),
                         Columns  = numeric(),
                         Duration = double())
for(test in TotalResult) {
  x <- by(test$Benchmark$time, test$Benchmark$expr, median)
  Comparison <- rbind(Comparison,
                      data.frame(
                        Approach = unlist(attr(x, "dimnames")),
                        Rows = test$Size, Columns = test$Width, 
                        Duration = unclass(x)
                      ))
}
Comparison$Rows <- as.factor(Comparison$Rows)
Comparison$Columns <- as.factor(Comparison$Columns)
Comparison$Approach <- factor(Comparison$Approach, levels = c("pmax", "apply.any", "apply.ifelse"))

library(ggplot2)
ggplot(data=Comparison, aes(x=Rows, y=Duration, fill=Approach)) +
  geom_bar(stat="identity", position=position_dodge()) +
  facet_wrap(~ Columns, strip.position = "bottom") +
  theme(strip.placement = "outside") +
  scale_fill_brewer(palette="Paired") + 
  labs(title="Approach Efficiency", x="Size of Data Frame (top: Cols/ bottom: Rows)", y = "Duration µs")


Comparison$RefValue <- Comparison$Duration[rep(seq(1, 25, 3), each=3)]
Comparison$Relative <- Comparison$Duration / Comparison$RefValue

ggplot(data=Comparison, aes(x=Rows, y=Relative, fill=Approach)) +
  geom_bar(stat="identity", position=position_dodge()) +
  facet_wrap(~ Columns, strip.position = "bottom") +
  theme(strip.placement = "outside") +
  scale_fill_brewer(palette="Paired") +
  labs(title="Relative Efficiency", x="Size of Data Frame (Cols/Rows)", y = "Duration µs")

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM