簡體   English   中英

如何將函數應用於多列以在 R 中創建多個新列?

[英]How to apply a function to multiple columns to create multiple new columns in R?

我有這個序列列表aqi_range和一個數據幀df

aqi_range = list(0:50,51:100,101:250)

df

   PM10_mean PM10_min PM10_max PM2.5_mean PM2.5_min PM2.5_max
 1      85.6        3      264       75.7         3       240
 2     105.         6      243       76.4         3       191
 3      95.8       19      287       48.4         8       134
 4      85.5       50      166       64.8        32       103
 5      55.9       24      117       46.7        19        77
 6      37.5        6      116       31.3         3        87
 7      26          5       69       15.5         3        49
 8      82.3       34      169       49.6        25       120
 9      170        68      272       133         67       201
10      254       189      323       226        173       269

現在我已經創建了這兩個非常簡單的函數,我想將它們應用到這個數據框來計算每種污染物的AQI=空氣質量指數

#a = column from a dataframe  **PM10_mean, PM2.5_mean**
#b = list of sequences defined above
min_max_diff <- function(a,b){
        for (i in b){
          if (a %in% i){
           min_val = min(i)
           max_val = max(i)
           return (max_val - min_val)
        }}}

#a = column from a dataframe  **PM10_mean, PM2.5_mean**
#b = list of sequences defined above
c_low <- function(a,b){
      for (i in b){
       if (a %in% i){
        min_val = min(i)
        return(min_val)
          } 
      }}

基本上,第一個函數“min_max_diff”獲取列 df$PM10_mean / df$PM2.5_mean 的值並在列表“aqi_range”中檢查它,然后返回一個特定值(它所在序列的最小值和最大值的差異)可用的)。 類似地,第二個函數“c_low”只返回序列的最小值。

我想將這種操作(下面定義的公式)應用於 PM10_mean 列以創建新列 PM10_AQI:

df$PM10_AQI  = min_max_diff(df$PM10_mean,aqi_range) / (df$PM10_max - df$PM10_min) / * (df$PM10_mean -  df$PM10_min) + c_low(df$PM10_mean,aqi_range)

我希望它能正確解釋。

如果您的問題只是如何計算數據幀中幾列的給定轉換,您可以編寫一個 for 循環,使用字符串轉換函數構造轉換中涉及的每個變量的名稱(在這種情況下sub()很有用) ,並使用[表示法(與$表示法相反——因為[表示法接受字符串來指定列)引用數據框中的列。

下面我展示了一個帶有 3 個觀察值的小樣本數據的代碼示例:

(請注意,我修改了 AQI 范圍值的定義(現在我只是定義了范圍變化的中斷點——假設它們都是整數),並且你的函數min_max_diff()c_low()被折疊成一個返回找到值的 AQI 范圍的最小值和最大值 - 再次假設 AQI 值是整數值)

# Definition of the AQI ranges (which are assumed to be based on integer values)
# Note that if the number of AQI ranges is k, the number of breaks is k+1
# Each break value defines the minimum of the range
# The maximum of each range is computed as the "minimum of the NEXT range" - 1
# (again this assumes integer values in AQI ranges)
# The values (e.g. PM10_mean) whose AQI range is searched for are assumed
# to NOT be larger than or equal to the largest break value.
aqi_range_breaks = c(0, 51, 101, 251)

# Example data (top 3 rows of the data frame you provided)
df = data.frame(PM10_mean=c(85.6, 105.0, 95.8),
                PM10_min=c(3, 6, 19),
                PM10_max=c(264, 243, 287),
                PM2.5_mean=c(75.7, 76.4, 48.4),
                PM2.5_min=c(3, 3, 8),
                PM2.5_max=c(240, 191, 134))

# Function that returns the minimum and maximum AQI values
# of the AQI range where the given values are found
# `values`: array of values that are searched for in the AQI ranges
# defined by the second parameter.
# `aqi_range_breaks`: breaks defining the minimum values of each AQI range
# plus one last value defining a value never attained by `values`.
# (all values in this parameter defining the AQI ranges are assumed integer values)
find_aqi_range_min_max <- function(values, aqi_range_breaks){
  aqi_range_groups = findInterval(values, aqi_range_breaks)
  return( list(min=aqi_range_breaks[aqi_range_groups],
               max=aqi_range_breaks[aqi_range_groups + 1] - 1))
}

# Run the variable transformation on the selected `_mean` columns
vars_mean = c("PM10_mean", "PM2.5_mean")
for (vmean in vars_mean) {
  vmin = sub("_mean$", "_min", vmean)
  vmax = sub("_mean$", "_max", vmean)
  vaqi = sub("_mean$", "_AQI", vmean)
  aqi_range_min_max = find_aqi_range_min_max(df[,vmean], aqi_range_breaks)
  df[,vaqi] = (aqi_range_min_max$max - aqi_range_min_max$min) / 
              (df[,vmax] - df[,vmin]) / (df[,vmean] -  df[,vmin]) +
              aqi_range_min_max$min
}

請注意findInterval()函數如何用於查找值數組所在的范圍。 這是使您的轉換適用於數據框列的關鍵。

此過程的預期輸出是:

  PM10_mean PM10_min PM10_max PM2.5_mean PM2.5_min PM2.5_max  PM10_AQI    PM2.5_AQI
1      85.6        3      264       75.7         3       240  51.00227 51.002843893
2     105.0        6      243       76.4         3       191 101.00635 51.003550930
3      95.8       19      287       48.4         8       134  51.00238  0.009822411

請檢查計算 AQI 的公式,因為其中存在語法錯誤(查找/ * ,我已在代碼的公式中將其替換為/ )。

請注意,在sub()使用的正則表達式中使用$來匹配字符串"_mean"僅當"_mean"字符串出現在變量名稱的末尾時才使用它來替換字符串。

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM