简体   繁体   English

如何创建一个 R 函数来将数据归一化为 100%

[英]How to create a R function to normalize data to sum up to 100%

I have average percent cover for each functional group according to Year, Month, Site, and Treatment (see photo).我根据年、月、地点和处理对每个功能组有平均百分比覆盖率(见图)。 These functional group average values do not sum up to 100% for each treatment group (sorted by year, month, and site).对于每个治疗组(按年、月和地点排序),这些功能组平均值的总和不等于 100%。 I would like to normalize it to 100%.我想将其标准化为 100%。 I was able to create an equation in Excel (as shown in the top of the photo);我能够在 Excel 中创建一个方程(如照片顶部所示); but it is labor intensive.但它是劳动密集型的。 I am not sure how to create a R function that would automatically do it.我不确定如何创建一个自动执行的 R 函数。 I tried to start writing it (below) but I know the sum(x) part is inaccurate.我试着开始写它(下面),但我知道sum(x)部分不准确。 I am not sure how to sum all of the functional group's percent cover for each treatment sorted by site, month and year.我不知道如何对按地点、月份和年份排序的每个处理的所有功能组的百分比覆盖率求和。 Perhaps using the aggregate function would help?也许使用聚合函数会有所帮助? Any help would be greatly appreciated!任何帮助将不胜感激!

normalize <- function(x, na.rm = TRUE) x*100/sum(x)

剪切数据的电子表格

Here's the reproducible example using the dput output.这是使用dput输出的可重现示例。

structure(
 list(
  Year = c(2018L, 2018L, 2018L, 2018L, 2018L, 2018L, 2018L, 2018L, 2018L, 2017L, 2017L, 2018L, 2018L, 2017L, 2018L, 2018L, 2018L, 2018L, 2018L, 2017L, 2018L, 2018L, 2018L, 2018L, 2018L),
  Month = structure(
   c(2L, 1L, 2L, 1L, 3L, 1L, 3L, 3L, 3L, 4L, 5L, 1L, 2L, 5L, 1L, 2L, 1L, 2L, 3L, 5L, 1L, 2L, 3L, 1L, 2L),
   .Label = c("1", "2", "3", "10", "11"),
   class = "factor"
   ),
  Site = structure(
   c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 1L, 1L, 1L, 1L, 2L, 2L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 1L, 1L),
   .Label = c("RR", "TMB"),
   class = "factor"
   ),
  Treatment = structure(
   c(6L, 7L, 7L, 5L, 5L, 1L, 1L, 4L, 2L, 3L, 4L, 6L, 6L, 6L, 6L, 6L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 5L, 5L),
   .Label = c("HLU", "U", "HU", "LU", "HL", "B", "H", "L", "P"),
   class = "factor"
   ), 
  Spp.Name = structure(
   c(4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 5L, 5L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L),
   .Label = c("Anemones", "Bare Rock", "Barnacles", "Biofilm", "Bleached Coarsely Branched", "Bleached Crustose", "Bleached Jointed Calcareous", "Bleached Sheet", "Brown Coarsely Branched", "Brown Crustose", "Brown Filamentous", "Brown Sheet", "Green Crustose", "Green Filamentous", "Green Sheet", "Mussels", "Red Coarsely Branched", "Red Crustose", "Red Filamentous", "Red Jointed Calcareous", "Red Sheet"),
   class = "factor"
   ), 
  Functional.Group = structure(
   c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L),
   .Label = c("Biofilm", "Bleached Coarsely Branched", "Bleached Crustose", "Bleached Jointed Calcareous", "Bleached Sheet", "Coarsely Branched", "Crustose", "Filamentous", "Invertebrates", "Jointed Calcareous", "Rock", "Sheet"),
   class = "factor"
   ), 
  Cover.Mean = c(12, 19, 2, 2, 6.66666666666667, 3, 13, 2, 1, 1, 3, 28, 9, 48.5, 5, 13, 39, 24, 5.66666666666667, 66.25, 6.66666666666667, 7, 4, 57.25, 41.25)
 ),
 row.names = c(NA, 25L),
 class = "data.frame"
)

Such operations when you want to perform calculations for every unique value in column are called as grouped operations.当您想对列中的每个唯一值执行计算时,此类操作称为分组操作。 There are various functions which would help you achieve what you want.有各种功能可以帮助您实现您想要的。

In base R, you can use ave在基础 R 中,您可以使用ave

df$Std.Cover <- with(df,  Cover.Mean/ave(Cover.Mean, Year, Month, Site, Treatment, 
                FUN = sum) * 100)

So here, the first value Cover.Mean in ave is the variable on which we want to apply the function sum but it is done for each Year , Month , Site and Treatment .所以在这里, ave的第一个值Cover.Mean是我们想要应用函数sum的变量,但它是针对每个YearMonthSiteTreatment We divide the sum of each group by Cover.Mean to get ratio and multiply it by 100 to get percentage.我们将每组的总和除以Cover.Mean得到比率,然后乘以 100 得到百分比。


We can also use solutions from different packages like dplyr我们还可以使用来自不同软件包的解决方案,例如dplyr

library(dplyr)

df %>%
  group_by(Year, Month, Site, Treatment) %>%
  mutate(Std.Cover = Cover.Mean/sum(Cover.Mean) * 100)

Or data.table或数据data.table

library(data.table)
setDT(df)[, Std.Cover := Cover.Mean/sum(Cover.Mean) * 100, 
                        .(Year, Month, Site, Treatment)]

Assigning your reproducible example to the df variable, you should be able to do what you are trying to do this way:将可重现的示例分配给df变量,您应该能够以这种方式执行您要执行的操作:

for (i in 1:nrow(df)) {
  df$Std.Cover.Mean[i] <- df$Cover.Mean[i] * 100 / sum(
    df$Cover.Mean[
      which(
        df$Year == df$Year[i] & df$Month == df$Month[i] & df$Site == df$Site[i] & df$Treatment == df$Treatment[i]
        )
      ]
    )
  }

Essentially, the sum function adds up all the Cover.Mean values where Year , Month , Site , and Treatment are the same as those of the row in question.本质上, sum 函数将所有Cover.Mean值相加,其中YearMonthSiteTreatment与相关行的值相同。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM