[英]Normalize data to value depending on multiple fields and conditions
I am quite new to R.我对 R 很陌生。 I have a table that with the header (Value, Benchmark, Suite, Var) and I want to normalize each Value to the mean of the baseline, depending on the combination of (Benchmark, Var).
我有一张带有 header (值,基准,套件,Var)的表,我想根据(基准,Var)的组合将每个值标准化为基线的平均值。 So, each entry
(Value, Benchmark, Suite, Var)
should be normalized to the mean value of the baseline where Benchmark
and var
are equal.因此,每个条目
(Value, Benchmark, Suite, Var)
都应该归一化为Benchmark
和var
相等的基线的平均值。
The data represents different benchmark measurements, where var are different input sizes.数据代表不同的基准测量,其中 var 是不同的输入大小。 The data looks like this draft:
数据看起来像这个草稿:
Value Benchmark Suite Var
500 Benchmark2 baseline 1732
889 Benchmark baseline 1732
500 Benchmark2 baseline 1732
889 Benchmark baseline 1732
300 Benchmark Approach1 1732
100 Benchmark2 Approach1 1732
After the transformation, it would look like this:转换后,它看起来像这样:
Value Benchmark Suite Var RuntimeRatio
500 Benchmark2 baseline 1732 1.00
889 Benchmark baseline 1732 1.00
500 Benchmark2 baseline 1732 1.00
889 Benchmark baseline 1732 1.00
300 Benchmark Approach1 1732 0.34 # 300 compared to mean(889,889) of each (Benchmark,baseline,1732)
100 Benchmark2 Approach1 1732 0.20 # 100 compared to mean(500,500) of each (Benchmark2,baseline,1732)
I currently have something like, but that does not calculate the right thing:我目前有类似的东西,但这并没有计算出正确的东西:
norm <- ddply(data, Var ~ Benchmark, transform,
RuntimeRatio = Value / mean(Value[Suite == "baseline"]))
I think the best and cleanest way to do it is to have a bit of data manipolation prior to the operation.我认为最好和最干净的方法是在操作之前进行一些数据操作。
Your Data:您的数据:
df <- tibble::tribble(
~Value, ~Benchmark , ~Suite , ~Var,
500 , "Benchmark2", "baseline" , 1732,
889 , "Benchmark" , "baseline" , 1732,
500 , "Benchmark2", "baseline" , 1732,
889 , "Benchmark" , "baseline" , 1732,
300 , "Benchmark" , "Approach1" , 1732,
100 , "Benchmark2", "Approach1" , 1732
)
With the package dplyr
we can easily and intuitively manipulate data.使用 package
dplyr
,我们可以轻松直观地操作数据。
library(dplyr)
# separate the baseline from the rest
df_baseline <- df %>% filter(Suite == "baseline")
df_compare <- df %>% filter(Suite != "baseline")
# calculate the mean of the baseline value for each Benchmark-Var
df_baseline <- df_baseline %>%
group_by(Benchmark, Var) %>%
summarise(Value_baseline = mean(Value)) %>%
ungroup()
# Join the baseline data to the rest of your data with the approaches
df_compare <- df_compare %>%
left_join(df_baseline, by = c("Benchmark", "Var"))
# Calculate your ratio
df_compare %>%
mutate(RuntimeRatio = Value / Value_baseline)
# # A tibble: 2 x 6
# Value Benchmark Suite Var Value_baseline RuntimeRatio
# <dbl> <chr> <chr> <dbl> <dbl> <dbl>
# 1 300 Benchmark Approach1 1732 889 0.337
# 2 100 Benchmark2 Approach1 1732 500 0.2
This approach gets what I believe you may need.这种方法得到了我相信你可能需要的东西。
But if you want exactly what you asked, you need to join df_baseline
to the original df
in this way:但是如果你想要你所要求的,你需要以这种方式将
df_baseline
加入到原始df
中:
df %>%
left_join(df_baseline, by = c("Benchmark", "Var")) %>%
mutate(RuntimeRatio = Value / Value_baseline) %>%
select(-Value_baseline)
# # A tibble: 6 x 5
# Value Benchmark Suite Var RuntimeRatio
# <dbl> <chr> <chr> <dbl> <dbl>
# 1 500 Benchmark2 baseline 1732 1
# 2 889 Benchmark baseline 1732 1
# 3 500 Benchmark2 baseline 1732 1
# 4 889 Benchmark baseline 1732 1
# 5 300 Benchmark Approach1 1732 0.337
# 6 100 Benchmark2 Approach1 1732 0.2
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.