R：如何循环从数据框中选择基于名称的变量，并为每个变量创建一个包含第一个列均值的新变量？

Question

I have a dataset containing a number of numeric variables whose names all start with "Ranking".我有一个数据集，其中包含许多名称都以“排名”开头的数字变量。 For each of these variables, I want to add another variable to the dataset that contains the column mean of the first variable.对于这些变量中的每一个，我想将另一个变量添加到包含第一个变量的列均值的数据集中。

So the data look something like this:所以数据看起来像这样：

| Ranking_blah | Ranking_bleh | 

| --------     | ----------   |

| 1            | 0            |

| 0            | 1            |

| NA           | 0.5          |

and what I want is:我想要的是：

| Ranking_blah | Ranking_bleh | Ranking_blah_mean | Ranking_bleh_mean |

| --------     | ----------   |----------------   |----------------|

| 1            | 0            | 0                 | 0.5            |

| -1           | 1            | 0                 | 0.5            |

| NA           | 0.5          | 0                 | 0.5

(I am aware this way the mean variables have the same values in all rows, respectively - I need this because the data will be reshaped later) （我知道这样平均变量在所有行中分别具有相同的值 - 我需要这个，因为稍后将重新调整数据）

What I've tried so far:到目前为止我已经尝试过：

#getting a list of all ranking variables I want to create a new mean variable from

ranking_variables = names(data)[grepl("Ranking", names(data))]

#creating a new variable for each base variable in the list and setting it to the mean of the respective base variable

data[paste0(ranking_variables, "_mean")] <- do.call(cbind, lapply(data[ranking_variables], function(x) mean(x, na.rm = TRUE)))

The second part is not working, though, it only yields NA values.但是，第二部分不起作用，它只产生 NA 值。 What am I doing wrong?我究竟做错了什么？

Answer 1

An alternative approach is to use dplyr 's across :另一种方法是使用dplyr across cross ：

dat |>
    mutate(across(starts_with("Ranking"), ~ mean(., na.rm = TRUE), .names = "{.col}_mean"))

Output:输出：

# A tibble: 3 × 4
  Ranking_blah Ranking_bleh Ranking_blah_mean Ranking_bleh_mean
         <dbl>        <dbl>             <dbl>             <dbl>
1            1          0                   0               0.5
2           -1          1                   0               0.5
3           NA          0.5                 0               0.5

Data:数据：

tibble(Ranking_blah = c(1,-1,NA), Ranking_bleh = c(0,1,0.5))

Answer 2

The across approach is fine, here is another one: across方法很好，这是另一种方法：

There is less struggle with tidy data, because R makes it easier to compute across rows than across columns.整洁的数据没有那么困难，因为 R 使得跨行计算比跨列计算更容易。

Tidy data means that every observation has its own row and every variable its own column.整齐的数据意味着每个观察都有自己的行，每个变量都有自己的列。 Columns are designed to represent variables.列旨在表示变量。 I think the "Ranking…" columns are not distinct variables, but different observations of the variable "type".我认为“排名...”列不是不同的变量，而是对变量“类型”的不同观察。 To fix this, we can use tidyr .为了解决这个问题，我们可以使用tidyr 。
See this chapter of R for data science.有关数据科学，请参阅 R 的这一章。

library(tidyverse)

data <- data.frame(Ranking_blah = c(1,-1,NA), Ranking_bleh = c(0,1,0.5))
data$id <- c(1:nrow(data))

pivot_longer(data,1:2,names_to = "type") %>%
  group_by(type) %>%
  mutate(mean = mean(value, na.rm = TRUE)) %>%
  ungroup()

# A tibble: 6 × 4
     id type         value  mean
  <int> <chr>        <dbl> <dbl>
1     1 Ranking_blah   1     0  
2     1 Ranking_bleh   0     0.5
3     2 Ranking_blah  -1     0  
4     2 Ranking_bleh   1     0.5
5     3 Ranking_blah  NA     0  
6     3 Ranking_bleh   0.5   0.5

This data is less human readable, but more R friendly.这些数据不太可读，但对 R 更友好。

R：如何循环从数据框中选择基于名称的变量，并为每个变量创建一个包含第一个列均值的新变量？

问题描述

2 个解决方案

解决方案1
2 已采纳 2022-07-21 11:47:28

解决方案2
0 2022-07-21 13:06:44

R：如何循环从数据框中选择基于名称的变量，并为每个变量创建一个包含第一个列均值的新变量？

问题描述

2 个解决方案

解决方案1 2 已采纳 2022-07-21 11:47:28

解决方案2 0 2022-07-21 13:06:44

解决方案1
2 已采纳 2022-07-21 11:47:28

解决方案2
0 2022-07-21 13:06:44