繁体   English   中英

如何根据列名将功能应用于特定列?

[英]How to apply function to specific columns based upon column name?

我正在使用类似于以下内容的广泛数据集:

在此处输入图像描述

我正在寻找一个函数,我可以迭代具有相似名称但名称不同的列集。 就函数本身而言,为了简单起见,我将创建一个取两列平均值的函数。

avg <- function(data, scorecol, distcol) {
  ScoreDistanceAvg = (scorecol + distcol)/2
  data$ScoreDistanceAvg <- ScoreDistanceAvg
  return(data)
}

avg(data = dat, scorecol = dat$ScoreGame0, distcol = dat$DistanceGame0)

如何将新函数应用于名称重复但数字不同的列集? 也就是说,如何创建一个取 ScoreGame0 和 DistanceGame0 均值的列,然后创建一个取 ScoreGame5 和 DistanceGame5 均值的列,等等? 这将是最终输出:

在此处输入图像描述

当然,我可以多次运行该函数,但由于我的完整数据集要大得多,我该如何自动化这个过程呢? 我想它涉及应用,但我不确定如何将应用与这样的重复模式一起使用。 此外,我想它可能涉及重写函数以更好地自动化列的命名。

数据:

structure(list(Player = c("Lebron James", "Lebron James", "Lebron James", 
"Lebron James", "Lebron James", "Lebron James", "Lebron James", 
"Lebron James", "Lebron James", "Lebron James", "Lebron James", 
"Lebron James", "Steph Curry", "Steph Curry", "Steph Curry", 
"Steph Curry", "Steph Curry", "Steph Curry", "Steph Curry", "Steph Curry", 
"Steph Curry", "Steph Curry", "Steph Curry", "Steph Curry"), 
    Game = c(0L, 1L, 2L, 3L, 4L, 5L, 0L, 1L, 2L, 3L, 4L, 5L, 
    0L, 1L, 2L, 3L, 4L, 5L, 0L, 1L, 2L, 3L, 4L, 5L), ScoreGame0 = c(32L, 
    32L, 32L, 32L, 32L, 32L, 44L, 44L, 44L, 44L, 44L, 44L, 45L, 
    45L, 45L, 45L, 45L, 45L, 76L, 76L, 76L, 76L, 76L, 76L), ScoreGame5 = c(27L, 
    27L, 27L, 27L, 27L, 27L, 12L, 12L, 12L, 12L, 12L, 12L, 76L, 
    76L, 76L, 76L, 76L, 76L, 32L, 32L, 32L, 32L, 32L, 32L), DistanceGame0 = c(12L, 
    12L, 12L, 12L, 12L, 12L, 79L, 79L, 79L, 79L, 79L, 79L, 18L, 
    18L, 18L, 18L, 18L, 18L, 88L, 88L, 88L, 88L, 88L, 88L), DistanceGame5 = c(13L, 
    13L, 13L, 13L, 13L, 13L, 34L, 34L, 34L, 34L, 34L, 34L, 42L, 
    42L, 42L, 42L, 42L, 42L, 54L, 54L, 54L, 54L, 54L, 54L)), class = "data.frame", row.names = c(NA, 
-24L))

稍微重写你的函数,并通过mapply在列grep使用它。 sort使这更加安全。

avg <- function(scorecol, distcol) {
  (scorecol + distcol)/2
}

mapply(avg, dat[sort(grep('ScoreGame', names(dat)))], dat[sort(grep('DistanceGame', names(dat)))])
#       ScoreGame0 ScoreGame5
#  [1,]       22.0         20
#  [2,]       22.0         20
#  [3,]       22.0         20
#  [4,]       22.0         20
#  [5,]       22.0         20
#  [6,]       22.0         20
#  [7,]       61.5         23
#  [8,]       61.5         23
#  [9,]       61.5         23
# [10,]       61.5         23
# [11,]       61.5         23
# [12,]       61.5         23
# [13,]       31.5         59
# [14,]       31.5         59
# [15,]       31.5         59
# [16,]       31.5         59
# [17,]       31.5         59
# [18,]       31.5         59
# [19,]       82.0         43
# [20,]       82.0         43
# [21,]       82.0         43
# [22,]       82.0         43
# [23,]       82.0         43
# [24,]       82.0         43

看看grep做了什么尝试

grep('DistanceGame', names(dat), value=TRUE)
# [1] "DistanceGame0" "DistanceGame5"

这是一个带有 forloop 和readr的解决方案:

library(readr)

game_num <- names(dat) |> 
  readr::parse_number() |> 
  na.omit()

for(i in unique(game_num)) {
  avg <- paste0("ScoreDistanceAvg", i)
  score <- paste0("ScoreGame", i)
  distance <- paste0("DistanceGame", i)
  dat[[avg]] <- (dat[[score]] + dat[[distance]])/2
}

这使:

         Player Game ScoreGame0 ScoreGame5 DistanceGame0 DistanceGame5 ScoreDistanceAvg0 ScoreDistanceAvg5
1  Lebron James    0         32         27            12            13              22.0                20
2  Lebron James    1         32         27            12            13              22.0                20
3  Lebron James    2         32         27            12            13              22.0                20
4  Lebron James    3         32         27            12            13              22.0                20
5  Lebron James    4         32         27            12            13              22.0                20
6  Lebron James    5         32         27            12            13              22.0                20
7  Lebron James    0         44         12            79            34              61.5                23
8  Lebron James    1         44         12            79            34              61.5                23
9  Lebron James    2         44         12            79            34              61.5                23
10 Lebron James    3         44         12            79            34              61.5                23
11 Lebron James    4         44         12            79            34              61.5                23
12 Lebron James    5         44         12            79            34              61.5                23
13  Steph Curry    0         45         76            18            42              31.5                59

在基础 R 中:

cols_used <- names(df[, -(1:2)])
f <- sub("[^0-9]+", 'ScoreDistance', cols_used)    
data.frame(lapply(split.default(df[cols_used], f), rowMeans))

  ScoreDistance0 ScoreDistance5
1            22.0             20
2            22.0             20
3            22.0             20
4            22.0             20
5            22.0             20
6            22.0             20
7            61.5             23
8            61.5             23
9            61.5             23
10           61.5             23
11           61.5             23
12           61.5             23
13           31.5             59
14           31.5             59
15           31.5             59
16           31.5             59
17           31.5             59
18           31.5             59
19           82.0             43
20           82.0             43
21           82.0             43
22           82.0             43
23           82.0             43
24           82.0             43

使用 tidyverse:

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM