简体   繁体   English

根据多个字段和条件将数据标准化为值

[英]Normalize data to value depending on multiple fields and conditions

I am quite new to R.我对 R 很陌生。 I have a table that with the header (Value, Benchmark, Suite, Var) and I want to normalize each Value to the mean of the baseline, depending on the combination of (Benchmark, Var).我有一张带有 header (值,基准,套件,Var)的表,我想根据(基准,Var)的组合将每个值标准化为基线的平均值。 So, each entry (Value, Benchmark, Suite, Var) should be normalized to the mean value of the baseline where Benchmark and var are equal.因此,每个条目(Value, Benchmark, Suite, Var)都应该归一化为Benchmarkvar相等的基线的平均值。

The data represents different benchmark measurements, where var are different input sizes.数据代表不同的基准测量,其中 var 是不同的输入大小。 The data looks like this draft:数据看起来像这个草稿:

Value   Benchmark  Suite     Var
500     Benchmark2 baseline  1732
889     Benchmark  baseline  1732
500     Benchmark2 baseline  1732
889     Benchmark  baseline  1732
300     Benchmark  Approach1 1732
100     Benchmark2 Approach1 1732

After the transformation, it would look like this:转换后,它看起来像这样:

Value   Benchmark  Suite     Var   RuntimeRatio
500     Benchmark2 baseline  1732  1.00
889     Benchmark  baseline  1732  1.00
500     Benchmark2 baseline  1732  1.00
889     Benchmark  baseline  1732  1.00
300     Benchmark  Approach1 1732  0.34 # 300 compared to mean(889,889) of each (Benchmark,baseline,1732)
100     Benchmark2 Approach1 1732  0.20 # 100 compared to mean(500,500) of each (Benchmark2,baseline,1732)

I currently have something like, but that does not calculate the right thing:我目前有类似的东西,但这并没有计算出正确的东西:

norm <- ddply(data, Var ~ Benchmark, transform,
          RuntimeRatio = Value / mean(Value[Suite == "baseline"]))

I think the best and cleanest way to do it is to have a bit of data manipolation prior to the operation.我认为最好和最干净的方法是在操作之前进行一些数据操作。

Your Data:您的数据:

df <- tibble::tribble(
  
  ~Value, ~Benchmark  ,  ~Suite     , ~Var,
  500   , "Benchmark2", "baseline"  , 1732,
  889   , "Benchmark" , "baseline"  , 1732,
  500   , "Benchmark2", "baseline"  , 1732,
  889   , "Benchmark" , "baseline"  , 1732,
  300   , "Benchmark" , "Approach1" , 1732,
  100   , "Benchmark2", "Approach1" , 1732
  
)

With the package dplyr we can easily and intuitively manipulate data.使用 package dplyr ,我们可以轻松直观地操作数据。

library(dplyr)

# separate the baseline from the rest
df_baseline <- df %>% filter(Suite == "baseline")
df_compare  <- df %>% filter(Suite != "baseline")

# calculate the mean of the baseline value for each Benchmark-Var
df_baseline <- df_baseline %>% 
  group_by(Benchmark, Var) %>% 
  summarise(Value_baseline = mean(Value)) %>% 
  ungroup()

# Join the baseline data to the rest of your data with the approaches
df_compare <- df_compare %>%
  left_join(df_baseline, by = c("Benchmark", "Var"))

# Calculate your ratio
df_compare %>%
  mutate(RuntimeRatio = Value / Value_baseline)

# # A tibble: 2 x 6
#   Value Benchmark  Suite       Var Value_baseline RuntimeRatio
#   <dbl> <chr>      <chr>     <dbl>          <dbl>        <dbl>
# 1   300 Benchmark  Approach1  1732            889        0.337
# 2   100 Benchmark2 Approach1  1732            500        0.2  

This approach gets what I believe you may need.这种方法得到了我相信你可能需要的东西。

But if you want exactly what you asked, you need to join df_baseline to the original df in this way:但是如果你想要你所要求的,你需要以这种方式将df_baseline加入到原始df中:

df %>% 
  left_join(df_baseline, by = c("Benchmark", "Var")) %>% 
  mutate(RuntimeRatio = Value / Value_baseline) %>% 
  select(-Value_baseline)

# # A tibble: 6 x 5
#   Value Benchmark  Suite       Var RuntimeRatio
#   <dbl> <chr>      <chr>     <dbl>        <dbl>
# 1   500 Benchmark2 baseline   1732        1    
# 2   889 Benchmark  baseline   1732        1    
# 3   500 Benchmark2 baseline   1732        1    
# 4   889 Benchmark  baseline   1732        1    
# 5   300 Benchmark  Approach1  1732        0.337
# 6   100 Benchmark2 Approach1  1732        0.2  

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM