![](/img/trans.png)
[英]How to average based on a Condition and append it to the bottom of a df with dplyr in R?
[英]Make an average based on condition from second df in R
我有以下問題。 我有兩個數據框。 在第二個。 有關於如何計算第一個 dataframe 中的新列的條件。 請參見下面的示例:第一個 df:
df1 <- data.frame(country = c("01", "01", "02", "03", "03", "03" , "04", "05"),
date = c("2020-01-01", "2020-01-02", "2020-01-02", "2020-01-02", "2020-01-03", "2020-01-04", "2020-01-01", "2020-01-02"),
value = c(4, 3, 2, -3, 1.5, 12, 10, 15),
blabla = c(23, 41, 32, 8, 50, 27, 8, 7)
)
第二個df:
df2 <- data.frame( country = c("01", "02", "03", "04", "05" ),
match_country1 = c("02", "03", "01", "01", "01"),
match_country2 = c("03", "04", "02", "02", "03"),
match_country3 = c("05", "05", "04", "03", "04")
)
現在我需要計算一個 new_value,它是 df2 中定義的三個值的平均值。 我需要尊重 df1 中的日期。 例如,國家“01”和日期“2020-01-01”的 new_value 是國家“02”、國家“03”、國家“05”所有日期“2020-01-01”的平均值。
所需的 output 如下:
new_df <- data.frame(country = c("01", "01", "02", "03", "03", "03" , "04", "05"),
date = c("2020-01-01", "2020-01-02", "2020-01-02", "2020-01-02", "2020-01-03", "2020-01-04", "2020-01-01", "2020-01-02"),
value = c(4, 3, 2, -3, 1.5, 12, 10, 15),
blabla = c(23, 41, 32, 8, 50, 27, 8, 7),
new_value = c(NA, #because no data for 02, 03, 05 from 2020-01-01
(2-3+15)/3,
(-3+15)/2, #because no data for 04 from 2020-01-02
(3+2)/2, #because no data for 04 from 2020-01-02
NA, #because no data for 01, 02, 04 from 2020-01-03
NA, #because no data for 01, 02, 04 from 2020-01-04
4, #because no data for 02, 03 from 2020-01-01
(3-3)/2 #because no data for 04 from 2020-01-02
)
)
請問我該怎么做?
這可以使用 SQL 三重連接來完成。 對於 df1 中的每一行,通過左連接在 df2 中獲取匹配的國家行,然后在 df1 的 b 實例中獲取日期相同且在 df2 中有國家匹配的所有行。 然后取匹配行中的平均 b 值。
library(sqldf)
sqldf("select a.*, avg(b.value) new_value
from df1 a
left join df2 c on a.country = c.country
left join df1 b on a.date = b.date and
b.country in (c.match_country1, c.match_country2, c.match_country3)
group by a.rowid")
給出這個數據框:
country date value blabla new_value
1 01 2020-01-01 4.0 23 NA
2 01 2020-01-02 3.0 41 4.666667
3 02 2020-01-02 2.0 32 6.000000
4 03 2020-01-02 -3.0 8 2.500000
5 03 2020-01-03 1.5 50 NA
6 03 2020-01-04 12.0 27 NA
7 04 2020-01-01 10.0 8 4.000000
8 05 2020-01-02 15.0 7 0.000000
這里有兩種變體。 第一個生成in (...)
字符串作為matches
項並將其替換,第二個將df2
轉換為長格式,首先是long
格式。
matches <- toString(names(df2)[-1])
fn$sqldf("select a.*, avg(b.value) new_value
from df1 a
left join df2 c on a.country = c.country
left join df1 b on a.date = b.date and b.country in ($matches)
group by a.rowid")
varying <- list(match_country = names(df2)[-1])
long <- reshape(df2, dir = "long", varying = varying, v.names = names(varying))
sqldf("select a.*, avg(b.value) new_value
from df1 a
left join long c on a.country = c.country
left join df1 b on a.date = b.date and b.country = c.match_country
group by a.rowid")
這種 tidyverse 方法可能會有所幫助
df1
#> country date value blabla
#> 1 01 2020-01-01 4.0 23
#> 2 01 2020-01-02 3.0 41
#> 3 02 2020-01-02 2.0 32
#> 4 03 2020-01-02 -3.0 8
#> 5 03 2020-01-03 1.5 50
#> 6 03 2020-01-04 12.0 27
#> 7 04 2020-01-01 10.0 8
#> 8 05 2020-01-02 15.0 7
df2
#> country match_country1 match_country2 match_country3
#> 1 01 02 03 05
#> 2 02 03 04 05
#> 3 03 01 02 04
#> 4 04 01 02 03
#> 5 05 01 03 04
suppressMessages(library(tidyverse))
df1 %>%
left_join(df2, by = 'country') %>%
nest(data = !date) %>%
mutate(data = map(data, ~.x %>%
mutate(across(contains('match'), ~value[match(., country)])) %>%
rowwise() %>%
mutate(avg = mean(c_across(contains('match')), na.rm = T)) %>%
select(!contains('match'))
)
) %>%
unnest(data)
#> # A tibble: 8 x 5
#> date country value blabla avg
#> <chr> <chr> <dbl> <dbl> <dbl>
#> 1 2020-01-01 01 4 23 NaN
#> 2 2020-01-01 04 10 8 4
#> 3 2020-01-02 01 3 41 4.67
#> 4 2020-01-02 02 2 32 6
#> 5 2020-01-02 03 -3 8 2.5
#> 6 2020-01-02 05 15 7 0
#> 7 2020-01-03 03 1.5 50 NaN
#> 8 2020-01-04 03 12 27 NaN
由reprex package (v2.0.0) 於 2021 年 5 月 2 日創建
雖然已經有一個公認的答案,但這里是一個基本的 R,因為發布的兩個答案(第二個)需要外部包。
df1$new_value <- with(df1, ave(seq_len(n), date, FUN = function(i){
mrg <- merge(df1[i, ], df2)
j <- grep("^match", names(mrg))
ctry <- unique(df1[i, "country"])
apply(mrg[j], 1, function(row){
k <- match(row, ctry)
if(any(!is.na(k)))
mean(mrg[k, "value"], na.rm = TRUE)
else NA_real_
})
}))
identical(df1$new_value, new_df$new_value)
#[1] TRUE
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.