[英]Extract values from data frame in R
Using R, I would like to find out which Samples (S1, S2, S3, S4, S5) fulfill the following criteria:contain minimally one value (x, y or z) bigger than 4. Thanks, Alex.使用 R,我想找出哪些样本(S1、S2、S3、S4、S5)满足以下条件:至少包含一个大于 4 的值(x、y 或 z)。谢谢,亚历克斯。
Sample x y z <br>
S1 -0.3 5.3 2.5 <br>
S2 0.4 0.2 -1.2 <br>
S3 1.2 -0.6 3.2 <br>
S4 4.3 0.7 5.7 <br>
S5 2.4 4.3 2.3 <br>
You could try a call to apply
- for example: 您可以尝试致电
apply
-例如:
> apply(dataFrameOfSamples,1,function(x)any(x > 4))
S1 S2 S3 S4 S5
TRUE FALSE FALSE TRUE TRUE
How does this sound? 听起来如何? Copy your data into your clipboard and execute the following commands:
将数据复制到剪贴板中并执行以下命令:
dta <- read.table("clipboard", header = T)
apply(dta[2:4], 1, function(x) ifelse(max(x) >= 4, 1, 0))
With many rows this could be more efficient: 对于许多行,这可能会更有效:
do.call(pmax, X[c("x","y","z")]) > 4
On your data 根据您的数据
ex <- data.frame(
Sample = c("S1", "S2", "S3", "S4", "S5"),
x = c(-0.3, 0.4, 1.2, 4.3, 2.4),
y = c( 5.3, 0.2,-0.6, 0.7, 4.3),
z = c( 2.5,-1.2, 3.2, 5.7, 2.3)
)
do.call(pmax, ex[c("x","y","z")]) > 4
# [1] TRUE FALSE FALSE TRUE TRUE
Benchmark summary : the pmax
approach could not only be more efficient as @MArek suggest, it is a lot more efficient than the other two.基准总结:
pmax
方法不仅可以像@MArek 建议的那样更有效,而且比其他两个方法更有效。 It loses some of it's advantage when a data frame has more columns but it is still the fastest approach.当数据帧具有更多列时,它会失去一些优势,但它仍然是最快的方法。
Benchmark .基准。 Being an empiricist I took the liberty of comparing the three approaches.
作为一名经验主义者,我冒昧地比较了这三种方法。 The 3 approaches were compared using
microbenchmark
.使用
microbenchmark
比较了这 3 种方法。 These are varied characteristics:这些是不同的特征:
The performance of the pmax
approach is remarkable. pmax
方法的性能非常出色。 It is a lot faster.它要快得多。 For the smallest data frame it has an advantage by a factor of > 3. For 2500 rows and 4 columns
pmax
is over 97 times faster than “apply.ifelse” and 57 times faster than "apply.any".对于最小的数据帧,它的优势大于 3。对于 2500 行和 4 列,
pmax
比“apply.ifelse”快 97 倍以上,比“apply.any”快 57 倍。
The following images shows the performance of the three approaches in relation to pmax
.下图显示了与
pmax
相关的三种方法的性能。 Hence, the performance of pmax
in each combination of rows and columns is always 1 (ie 100%) and the other two approaches are shown in relation to that.因此,
pmax
在每个行和列组合中的性能始终为 1(即 100%),其他两种方法也与此相关。 It shows that the performance of pmax
is superior especially for the data frames with fewer columns.这表明
pmax
的性能优越,尤其是对于列数较少的数据帧。
Since pmax
seems to lose it's advantage with an increasing number of column, it could be that the other approaches become faster with a large number of columns.由于
pmax
似乎随着列数的增加而失去了它的优势,因此其他方法可能会随着列数的增加而变得更快。
Code used in this post:这篇文章中使用的代码:
library(microbenchmark)
TotalResult <- list()
for (Width in c(4, 500, 1000)) {
for (Size in c(10, 500, 2500)) {
ex <- data.frame(
Sample = paste0("S", 1:Size),
x = runif(Size, 0, 6),
y = runif(Size, 0, 6),
z = runif(Size, 0, 6)
)
if (Width > 4)
for (i in 5:Width)
ex[[i]] <- runif(Size, 0, 6)
result <- microbenchmark(
pmax = { do.call(pmax, ex[2:Width]) > 4 },
apply.ifelse = { apply(ex[2:Width], 1, function(x) ifelse(max(x) > 4, TRUE, FALSE)) },
apply.any = apply(ex[2:Width], 1, function(x) any(x > 4)),
check = "identical"
)
cat("Benchmark: Size =", Size, "// Width =", Width, "\n")
print(result)
#boxplot(result)
TotalResult <- c(TotalResult, list(list(Size=Size, Width=Width, Benchmark=result)))
}
}
Comparison <- data.frame(Approach = character(),
Rows = numeric(),
Columns = numeric(),
Duration = double())
for(test in TotalResult) {
x <- by(test$Benchmark$time, test$Benchmark$expr, median)
Comparison <- rbind(Comparison,
data.frame(
Approach = unlist(attr(x, "dimnames")),
Rows = test$Size, Columns = test$Width,
Duration = unclass(x)
))
}
Comparison$Rows <- as.factor(Comparison$Rows)
Comparison$Columns <- as.factor(Comparison$Columns)
Comparison$Approach <- factor(Comparison$Approach, levels = c("pmax", "apply.any", "apply.ifelse"))
library(ggplot2)
ggplot(data=Comparison, aes(x=Rows, y=Duration, fill=Approach)) +
geom_bar(stat="identity", position=position_dodge()) +
facet_wrap(~ Columns, strip.position = "bottom") +
theme(strip.placement = "outside") +
scale_fill_brewer(palette="Paired") +
labs(title="Approach Efficiency", x="Size of Data Frame (top: Cols/ bottom: Rows)", y = "Duration µs")
Comparison$RefValue <- Comparison$Duration[rep(seq(1, 25, 3), each=3)]
Comparison$Relative <- Comparison$Duration / Comparison$RefValue
ggplot(data=Comparison, aes(x=Rows, y=Relative, fill=Approach)) +
geom_bar(stat="identity", position=position_dodge()) +
facet_wrap(~ Columns, strip.position = "bottom") +
theme(strip.placement = "outside") +
scale_fill_brewer(palette="Paired") +
labs(title="Relative Efficiency", x="Size of Data Frame (Cols/Rows)", y = "Duration µs")
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.