I have a data frame with 4 columns (part of it shown below).
The first column shows groups ordered by numbers: 1, 2, ....
I want to generate a new column "value4". For each group, if the group size is bigger than 2 (>=3), and all the values in column "value1" are bigger than 2 (>2) or smaller than -2 (< -2), then the median of the corresponding values in column "value3" are calculated and put in column "value4" for each row of this group. Otherwise, the values from "value2" is taken to column "value4".
g value1 value2 value3
1 1.1 8 1
1 1.2 8 1
1 1.3 9 1
2 3 10 5
2 4 11 5
2 5 0 4
2 6 1 6
3 -3 2 5
3 -4 3 10
3 -5 4 0
4 -3 1 0
4 -4 1 0
The output will be:
g value1 value2 value3 value4
1 1.1 8 1 8 # for group "1", all the values in "value1" are <2, so the values from column "value2" are taken
1 1.2 8 1 8
1 1.3 9 1 9
2 3 10 5 5 # for group "2", all the values in "value1" are >2, median of numbers 5,5,4,6 from column "value3" is calculated
2 4 11 5 5
2 5 0 4 5
2 6 1 6 5
3 -3 2 5 5 # for group "3", all the values in "value1" are < -2, median of numbers 5,10,0 from column "value3" is calculated
3 -4 3 10 5
3 -5 4 0 5
4 -3 1 0 1 # group size less than 3, so the values from column "value2" are taken
4 -4 1 0 1
I think I can use aggregate(), but I don't know how to integrate the conditions. I appreciate your time and help.
Based on the condition, we can use a if/else
condition utilizing the groupsize ( n()
) and if all
value1 less than -2 or greater than 2,then get the median
of 'value3' or else
return 'value2'
library(dplyr)
df1 %>%
group_by(g) %>%
mutate(value4 = if(n() > 2 & (all(value1 > 2)| all(value1 < -2))) median(value3)
else value2)
# A tibble: 12 x 5
# Groups: g [4]
# g value1 value2 value3 value4
# <int> <dbl> <int> <int> <dbl>
# 1 1 1.1 8 1 8
# 2 1 1.2 8 1 8
# 3 1 1.3 9 1 9
# 4 2 3 10 5 5
# 5 2 4 11 5 5
# 6 2 5 0 4 5
# 7 2 6 1 6 5
# 8 3 -3 2 5 5
# 9 3 -4 3 10 5
#10 3 -5 4 0 5
#11 4 -3 1 0 1
#12 4 -4 1 0 1
df1 <- structure(list(g = c(1L, 1L, 1L, 2L, 2L, 2L, 2L, 3L, 3L, 3L,
4L, 4L), value1 = c(1.1, 1.2, 1.3, 3, 4, 5, 6, -3, -4, -5, -3,
-4), value2 = c(8L, 8L, 9L, 10L, 11L, 0L, 1L, 2L, 3L, 4L, 1L,
1L), value3 = c(1L, 1L, 1L, 5L, 5L, 4L, 6L, 5L, 10L, 0L, 0L,
0L)), class = "data.frame", row.names = c(NA, -12L))
You can use the package data.table
as follows:
library(data.table)
setDT(df)[, value4 := if(.N > 2 & (all(value1 > 2) | all(value1 < -2))) median(value3) else value2, g]
This is an ideal situation for case_when() .*
You would like value4
to be calculated based on the following condition:
If Group size > 2 and the absolute value of all value1
in a group > 2 => take the median of value3
. Otherwise use value2
library(dplyr)
df %>%
group_by(g) %>%
mutate(value4 = case_when( (n() > 2) & (all(abs(value1) > 2)) ~ median(value3),
T ~ value2)
*One would think we could use if_else()
here because there is only one condition but for some reason, it was failing when using all()
in the condition. I think it was returning multiple values? Unclear, but maybe someone else could explain.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.