简体   繁体   English

如何将函数应用于多列以在 R 中创建多个新列?

[英]How to apply a function to multiple columns to create multiple new columns in R?

I've this list of sequences aqi_range and a dataframe df :我有这个序列列表aqi_range和一个数据帧df

aqi_range = list(0:50,51:100,101:250)

df df

   PM10_mean PM10_min PM10_max PM2.5_mean PM2.5_min PM2.5_max
 1      85.6        3      264       75.7         3       240
 2     105.         6      243       76.4         3       191
 3      95.8       19      287       48.4         8       134
 4      85.5       50      166       64.8        32       103
 5      55.9       24      117       46.7        19        77
 6      37.5        6      116       31.3         3        87
 7      26          5       69       15.5         3        49
 8      82.3       34      169       49.6        25       120
 9      170        68      272       133         67       201
10      254       189      323       226        173       269

Now I've created these two pretty simple functions that i want to apply to this dataframe to calculate the AQI=Air Quality Index for each pollutant.现在我已经创建了这两个非常简单的函数,我想将它们应用到这个数据框来计算每种污染物的AQI=空气质量指数

#a = column from a dataframe  **PM10_mean, PM2.5_mean**
#b = list of sequences defined above
min_max_diff <- function(a,b){
        for (i in b){
          if (a %in% i){
           min_val = min(i)
           max_val = max(i)
           return (max_val - min_val)
        }}}

#a = column from a dataframe  **PM10_mean, PM2.5_mean**
#b = list of sequences defined above
c_low <- function(a,b){
      for (i in b){
       if (a %in% i){
        min_val = min(i)
        return(min_val)
          } 
      }}

Basically the first function "min_max_diff" takes the value of column df$PM10_mean / df$PM2.5_mean and check for it in the list "aqi_range" and then returns a certain value (difference of min and max value of the sequence in which it's available).基本上,第一个函数“min_max_diff”获取列 df$PM10_mean / df$PM2.5_mean 的值并在列表“aqi_range”中检查它,然后返回一个特定值(它所在序列的最小值和最大值的差异)可用的)。 Similarly the second function "c_low" just returns the minimum value of the sequence.类似地,第二个函数“c_low”只返回序列的最小值。

I want to apply this kind of manipulation (formula defined below) to PM10_mean column to create new columns PM10_AQI:我想将这种操作(下面定义的公式)应用于 PM10_mean 列以创建新列 PM10_AQI:

df$PM10_AQI  = min_max_diff(df$PM10_mean,aqi_range) / (df$PM10_max - df$PM10_min) / * (df$PM10_mean -  df$PM10_min) + c_low(df$PM10_mean,aqi_range)

I hope it explains it properly.我希望它能正确解释。

If your problem is just how to compute the given transformation to several columns in a data frame, you could write a for loop, construct the name of each variable involved in the transformation using string transformation functions (in this case sub() is useful), and refer to the columns in the data frame using the [ notation (as opposed to the $ notation --since the [ notation accepts strings to specify columns).如果您的问题只是如何计算数据帧中几列的给定转换,您可以编写一个 for 循环,使用字符串转换函数构造转换中涉及的每个变量的名称(在这种情况下sub()很有用) ,并使用[表示法(与$表示法相反——因为[表示法接受字符串来指定列)引用数据框中的列。

Following I show an example of such code with a small sample data with 3 observations:下面我展示了一个带有 3 个观察值的小样本数据的代码示例:

(note that I modified the definition of the AQI range values (now I just define the breaks where the range changes --assuming they are all integers), and your functions min_max_diff() and c_low() which are collapsed into one single function returning the min and max values of the AQI range where the values are found --again this assumes that the AQI values are integer values) (请注意,我修改了 AQI 范围值的定义(现在我只是定义了范围变化的中断点——假设它们都是整数),并且你的函数min_max_diff()c_low()被折叠成一个返回找到值的 AQI 范围的最小值和最大值 - 再次假设 AQI 值是整数值)

# Definition of the AQI ranges (which are assumed to be based on integer values)
# Note that if the number of AQI ranges is k, the number of breaks is k+1
# Each break value defines the minimum of the range
# The maximum of each range is computed as the "minimum of the NEXT range" - 1
# (again this assumes integer values in AQI ranges)
# The values (e.g. PM10_mean) whose AQI range is searched for are assumed
# to NOT be larger than or equal to the largest break value.
aqi_range_breaks = c(0, 51, 101, 251)

# Example data (top 3 rows of the data frame you provided)
df = data.frame(PM10_mean=c(85.6, 105.0, 95.8),
                PM10_min=c(3, 6, 19),
                PM10_max=c(264, 243, 287),
                PM2.5_mean=c(75.7, 76.4, 48.4),
                PM2.5_min=c(3, 3, 8),
                PM2.5_max=c(240, 191, 134))

# Function that returns the minimum and maximum AQI values
# of the AQI range where the given values are found
# `values`: array of values that are searched for in the AQI ranges
# defined by the second parameter.
# `aqi_range_breaks`: breaks defining the minimum values of each AQI range
# plus one last value defining a value never attained by `values`.
# (all values in this parameter defining the AQI ranges are assumed integer values)
find_aqi_range_min_max <- function(values, aqi_range_breaks){
  aqi_range_groups = findInterval(values, aqi_range_breaks)
  return( list(min=aqi_range_breaks[aqi_range_groups],
               max=aqi_range_breaks[aqi_range_groups + 1] - 1))
}

# Run the variable transformation on the selected `_mean` columns
vars_mean = c("PM10_mean", "PM2.5_mean")
for (vmean in vars_mean) {
  vmin = sub("_mean$", "_min", vmean)
  vmax = sub("_mean$", "_max", vmean)
  vaqi = sub("_mean$", "_AQI", vmean)
  aqi_range_min_max = find_aqi_range_min_max(df[,vmean], aqi_range_breaks)
  df[,vaqi] = (aqi_range_min_max$max - aqi_range_min_max$min) / 
              (df[,vmax] - df[,vmin]) / (df[,vmean] -  df[,vmin]) +
              aqi_range_min_max$min
}

Note how the findInterval() function has been used to find the range where an array of values fall.请注意findInterval()函数如何用于查找值数组所在的范围。 That was the key to make your transformation work for a data frame column.这是使您的转换适用于数据框列的关键。

The expected output of this process is:此过程的预期输出是:

  PM10_mean PM10_min PM10_max PM2.5_mean PM2.5_min PM2.5_max  PM10_AQI    PM2.5_AQI
1      85.6        3      264       75.7         3       240  51.00227 51.002843893
2     105.0        6      243       76.4         3       191 101.00635 51.003550930
3      95.8       19      287       48.4         8       134  51.00238  0.009822411

Please check the formula that computes AQI because you had a syntax error in it (look for / * , which I have replaced with / in the formula in my code).请检查计算 AQI 的公式,因为其中存在语法错误(查找/ * ,我已在代码的公式中将其替换为/ )。

Note that the use of $ in the regular expression used in sub() to match the string "_mean" is used to replace the "_mean" string only when it occurs at the end of the variable name.请注意,在sub()使用的正则表达式中使用$来匹配字符串"_mean"仅当"_mean"字符串出现在变量名称的末尾时才使用它来替换字符串。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何跨多个列应用 ifelse function 并在 R 中创建新列 - How to apply ifelse function across multiple columns and create new columns in R 如何在多个列上应用具有多个条件的 function 以获得 R 中的新条件列 - How to apply function with multiple conditions on multiple columns to get new conditional columns in R 应用 function 为跨多个列的过滤列创建平均值 r - Apply function to create mean for filtered columns across multiple columns r R将功能应用于多列 - R apply function to multiple columns 将函数应用于多个并发列并输出到新列-R - Apply function to multiple concurrent columns and output to new column - R 如何在 R 中创建索引为 function 的多列? - How to create multiple columns with index function in R? 将具有多列的相同 function 作为输入应用到具有 tidyverse 的 R 中的多列 - Apply the same function with multiple columns as inputs to multiple columns in R with tidyverse 使用 apply 在 R 的多个列上运行 function - Using apply to run a function on multiple columns in R R:如何应用为多列输出数据帧的函数(使用dplyr)? - R: How to apply a function that outputs a dataframe for multiple columns (using dplyr)? 如何通过R中的多个因素将函数应用于矩阵列? - How to apply function over columns of matrix by multiple factors in R?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM