[英]R: How to loop over a name-based selection of variables from a dataframe and for each create a new variable containing the column mean of the first?
I have a dataset containing a number of numeric variables whose names all start with "Ranking".我有一个数据集,其中包含许多名称都以“排名”开头的数字变量。 For each of these variables, I want to add another variable to the dataset that contains the column mean of the first variable.
对于这些变量中的每一个,我想将另一个变量添加到包含第一个变量的列均值的数据集中。
So the data look something like this:所以数据看起来像这样:
| Ranking_blah | Ranking_bleh |
| -------- | ---------- |
| 1 | 0 |
| 0 | 1 |
| NA | 0.5 |
and what I want is:我想要的是:
| Ranking_blah | Ranking_bleh | Ranking_blah_mean | Ranking_bleh_mean |
| -------- | ---------- |---------------- |----------------|
| 1 | 0 | 0 | 0.5 |
| -1 | 1 | 0 | 0.5 |
| NA | 0.5 | 0 | 0.5
(I am aware this way the mean variables have the same values in all rows, respectively - I need this because the data will be reshaped later) (我知道这样平均变量在所有行中分别具有相同的值 - 我需要这个,因为稍后将重新调整数据)
What I've tried so far:到目前为止我已经尝试过:
#getting a list of all ranking variables I want to create a new mean variable from
ranking_variables = names(data)[grepl("Ranking", names(data))]
#creating a new variable for each base variable in the list and setting it to the mean of the respective base variable
data[paste0(ranking_variables, "_mean")] <- do.call(cbind, lapply(data[ranking_variables], function(x) mean(x, na.rm = TRUE)))
The second part is not working, though, it only yields NA values.但是,第二部分不起作用,它只产生 NA 值。 What am I doing wrong?
我究竟做错了什么?
An alternative approach is to use dplyr
's across
:另一种方法是使用
dplyr
across
cross :
dat |>
mutate(across(starts_with("Ranking"), ~ mean(., na.rm = TRUE), .names = "{.col}_mean"))
Output:输出:
# A tibble: 3 × 4
Ranking_blah Ranking_bleh Ranking_blah_mean Ranking_bleh_mean
<dbl> <dbl> <dbl> <dbl>
1 1 0 0 0.5
2 -1 1 0 0.5
3 NA 0.5 0 0.5
Data:数据:
tibble(Ranking_blah = c(1,-1,NA), Ranking_bleh = c(0,1,0.5))
The across
approach is fine, here is another one: across
方法很好,这是另一种方法:
There is less struggle with tidy data, because R makes it easier to compute across rows than across columns.整洁的数据没有那么困难,因为 R 使得跨行计算比跨列计算更容易。
Tidy data means that every observation has its own row and every variable its own column.整齐的数据意味着每个观察都有自己的行,每个变量都有自己的列。 Columns are designed to represent variables.
列旨在表示变量。 I think the "Ranking…" columns are not distinct variables, but different observations of the variable "type".
我认为“排名...”列不是不同的变量,而是对变量“类型”的不同观察。 To fix this, we can use
tidyr
.为了解决这个问题,我们可以使用
tidyr
。
See this chapter of R for data science.有关数据科学,请参阅 R 的这一章。
library(tidyverse)
data <- data.frame(Ranking_blah = c(1,-1,NA), Ranking_bleh = c(0,1,0.5))
data$id <- c(1:nrow(data))
pivot_longer(data,1:2,names_to = "type") %>%
group_by(type) %>%
mutate(mean = mean(value, na.rm = TRUE)) %>%
ungroup()
# A tibble: 6 × 4
id type value mean
<int> <chr> <dbl> <dbl>
1 1 Ranking_blah 1 0
2 1 Ranking_bleh 0 0.5
3 2 Ranking_blah -1 0
4 2 Ranking_bleh 1 0.5
5 3 Ranking_blah NA 0
6 3 Ranking_bleh 0.5 0.5
This data is less human readable, but more R friendly.这些数据不太可读,但对 R 更友好。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.