[英]How do I create a new variable for a group of observations based on another variable specific to that group
I'm trying add a new variable that is based on the observation for one level of a factor within a groups in my dataset. 我正在尝试添加一个新变量,该变量基于对我的数据集中组中某个因素水平的观察。 I've been trying to utilize various dplyr functions (
filter
, select
, mutate
, group_by
) but can't figure out how to get them to work together and accomplish my goal. 我一直在尝试利用各种dplyr函数(
filter
, select
, mutate
, group_by
),但无法弄清楚如何使它们协同工作并实现我的目标。
here is a sample of my data: 这是我的数据样本:
rep rate n mort avg
<fct> <fct> <int> <dbl> <dbl>
1 1 0.747 10 7 0.7
2 1 0.373 10 7 0.7
3 1 0.187 10 6 0.6
4 1 0.0933 10 0 0
5 1 0.00 10 1 0.1
6 2 0.747 10 7 0.7
7 2 0.373 10 5 0.5
8 2 0.187 10 1 0.1
9 2 0.0933 10 4 0.4
10 2 0.00 10 0 0
What I'm hoping to accomplish is to create a new variable called cont
that is derived from the avg
variable when rate == "0.00"
. 我希望完成的工作是创建一个名为
cont
的新变量,该变量是从rate == "0.00"
时从avg
变量派生的。 This variable would be the same for each observation within the same rep
group. 对于同一
rep
组中的每个观察,此变量将是相同的。 The final product would be a table similar to the one below: 最终产品将是与以下表格相似的表格:
rep rate n mort avg cont
<fct> <fct> <int> <dbl> <dbl> <dbl>
1 1 0.747 10 7 0.7 0.1
2 1 0.373 10 7 0.7 0.1
3 1 0.187 10 6 0.6 0.1
4 1 0.0933 10 0 0 0.1
5 1 0.00 10 1 0.1 0.1
6 2 0.747 10 7 0.7 0
7 2 0.373 10 5 0.5 0
8 2 0.187 10 1 0.1 0
9 2 0.0933 10 4 0.4 0
10 2 0.00 10 0 0 0
I've tried the following code: data %>% group_by(rep) %>% filter(rate =="0.00") %>% select(avg)
which results in a dataframe with the data that I do want added as the new variable: 我试过下面的代码:
data %>% group_by(rep) %>% filter(rate =="0.00") %>% select(avg)
,这将导致一个数据帧包含我想要添加为的数据新变量:
rep avg
<fct> <dbl>
1 1 0.1
2 2 0
3 3 0.1
4 4 0.3
5 5 0
6 6 0
7 7 0
8 8 0
My problem now is that I have no idea how to create the new variable for each observation within the rep
group. 我现在的问题是我不知道如何为
rep
组中的每个观察值创建新变量。 I'm not sure how to use mutate
properly in this situation. 我不确定在这种情况下如何正确使用
mutate
。 Thank you in advance for any help! 预先感谢您的任何帮助!
Assuming there would be only one occurrence of rate == "0.00"
in each group, we can do 假设每个组中仅出现一次
rate == "0.00"
,我们可以
library(dplyr)
df %>%
group_by(rep) %>%
mutate(cont = avg[rate == "0.00"])
# rep rate n mort avg cont
# <fct> <fct> <int> <dbl> <dbl> <dbl>
# 1 1 0.747 10 7 0.7 0.1
# 2 1 0.373 10 7 0.7 0.1
# 3 1 0.187 10 6 0.6 0.1
# 4 1 0.0933 10 0 0 0.1
# 5 1 0.00 10 1 0.1 0.1
# 6 2 0.747 10 7 0.7 0
# 7 2 0.373 10 5 0.5 0
# 8 2 0.187 10 1 0.1 0
# 9 2 0.0933 10 4 0.4 0
#10 2 0.00 10 0 0 0
If there are more than one occurrence, we can use which.max
to select the first one 如果出现多个,我们可以使用
which.max
选择第一个
df %>% group_by(rep) %>% mutate(cont = avg[which.max(rate == "0.00")])
Using data.table
, we can do 使用
data.table
,我们可以做
library(data.table)
setDT(df)[, cont := avg[rate == "0.00"], by = rep]
data 数据
df <- structure(list(rep = structure(c(1L, 1L, 1L, 1L, 1L, 2L, 2L,
2L, 2L, 2L), .Label = c("1", "2"), class = "factor"), rate = structure(c(5L,
4L, 3L, 2L, 1L, 5L, 4L, 3L, 2L, 1L), .Label = c("0.00", "0.0933",
"0.187", "0.373", "0.747"), class = "factor"), n = c(10L, 10L,
10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L), mort = c(7, 7, 6, 0,
1, 7, 5, 1, 4, 0), avg = c(0.7, 0.7, 0.6, 0, 0.1, 0.7, 0.5, 0.1,
0.4, 0)), row.names = c("1", "2", "3", "4", "5", "6", "7", "8",
"9", "10"), class = "data.frame")
We can use match
我们可以使用
match
library(dplyr)
df %>%
group_by(rep) %>%
mutate(cont = avg[match("0.00", rate)])
# A tibble: 10 x 6
# Groups: rep [2]
# rep rate n mort avg cont
# <fct> <fct> <int> <dbl> <dbl> <dbl>
# 1 1 0.747 10 7 0.7 0.1
# 2 1 0.373 10 7 0.7 0.1
# 3 1 0.187 10 6 0.6 0.1
# 4 1 0.0933 10 0 0 0.1
# 5 1 0.00 10 1 0.1 0.1
# 6 2 0.747 10 7 0.7 0
# 7 2 0.373 10 5 0.5 0
# 8 2 0.187 10 1 0.1 0
# 9 2 0.0933 10 4 0.4 0
#10 2 0.00 10 0 0 0
Or with data.table
或与
data.table
library(data.table)
setDT(df)[, cont := avg[match("0.00", rate)], rep]
Or using the join as @thelatemail suggested 或使用@thelatemail建议的联接
setDT(df)[df[rate=="0.00"], on= .(rep), cont := i.avg]
Note; 注意; Both the methods would work even if there are duplicate values as
match
returns only the index of the first match. 即使存在重复的值,这两种方法也都可以工作,因为
match
仅返回第一个匹配项的索引。
df <- structure(list(rep = structure(c(1L, 1L, 1L, 1L, 1L, 2L, 2L,
2L, 2L, 2L), .Label = c("1", "2"), class = "factor"), rate = structure(c(5L,
4L, 3L, 2L, 1L, 5L, 4L, 3L, 2L, 1L), .Label = c("0.00", "0.0933",
"0.187", "0.373", "0.747"), class = "factor"), n = c(10L, 10L,
10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L), mort = c(7, 7, 6, 0,
1, 7, 5, 1, 4, 0), avg = c(0.7, 0.7, 0.6, 0, 0.1, 0.7, 0.5, 0.1,
0.4, 0)), row.names = c("1", "2", "3", "4", "5", "6", "7", "8",
"9", "10"), class = "data.frame")
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.