簡體   English   中英

根據 R 中第二個 df 的條件進行平均

[英]Make an average based on condition from second df in R

我有以下問題。 我有兩個數據框。 在第二個。 有關於如何計算第一個 dataframe 中的新列的條件。 請參見下面的示例:第一個 df:

df1 <- data.frame(country = c("01", "01", "02", "03", "03", "03" , "04", "05"),
                  date = c("2020-01-01", "2020-01-02", "2020-01-02", "2020-01-02", "2020-01-03", "2020-01-04", "2020-01-01", "2020-01-02"),
                  value = c(4, 3, 2, -3, 1.5, 12, 10, 15),
                  blabla = c(23, 41, 32, 8, 50, 27, 8, 7)
)

第二個df:

df2 <- data.frame(       country = c("01",  "02", "03", "04", "05" ),
                  match_country1 = c("02",  "03", "01", "01", "01"), 
                  match_country2 = c("03",  "04", "02", "02", "03"), 
                  match_country3 = c("05",  "05", "04", "03", "04")
)

現在我需要計算一個 new_value,它是 df2 中定義的三個值的平均值。 我需要尊重 df1 中的日期。 例如,國家“01”和日期“2020-01-01”的 new_value 是國家“02”、國家“03”、國家“05”所有日期“2020-01-01”的平均值。

所需的 output 如下:

new_df <- data.frame(country = c("01", "01", "02", "03", "03", "03" , "04", "05"),
                  date = c("2020-01-01", "2020-01-02", "2020-01-02", "2020-01-02", "2020-01-03", "2020-01-04", "2020-01-01", "2020-01-02"),
                  value = c(4, 3, 2, -3, 1.5, 12, 10, 15),
                  blabla = c(23, 41, 32, 8, 50, 27, 8, 7),
                  new_value = c(NA, #because no data for 02, 03, 05 from 2020-01-01
                                (2-3+15)/3,
                                (-3+15)/2, #because no data for 04 from 2020-01-02
                                (3+2)/2, #because no data for 04 from 2020-01-02
                                NA,  #because no data for 01, 02, 04 from 2020-01-03
                                NA,  #because no data for 01, 02, 04 from 2020-01-04
                                4, #because no data for 02, 03 from 2020-01-01
                                (3-3)/2 #because no data for 04 from 2020-01-02
  )
)

請問我該怎么做?

這可以使用 SQL 三重連接來完成。 對於 df1 中的每一行,通過左連接在 df2 中獲取匹配的國家行,然后在 df1 的 b 實例中獲取日期相同且在 df2 中有國家匹配的所有行。 然后取匹配行中的平均 b 值。

library(sqldf)
sqldf("select a.*, avg(b.value) new_value
  from df1 a
  left join df2 c on a.country = c.country
  left join df1 b on a.date = b.date and 
    b.country in (c.match_country1, c.match_country2, c.match_country3)
  group by a.rowid")

給出這個數據框:

  country       date value blabla new_value
1      01 2020-01-01   4.0     23        NA
2      01 2020-01-02   3.0     41  4.666667
3      02 2020-01-02   2.0     32  6.000000
4      03 2020-01-02  -3.0      8  2.500000
5      03 2020-01-03   1.5     50        NA
6      03 2020-01-04  12.0     27        NA
7      04 2020-01-01  10.0      8  4.000000
8      05 2020-01-02  15.0      7  0.000000

變化

這里有兩種變體。 第一個生成in (...)字符串作為matches項並將其替換,第二個將df2轉換為長格式,首先是long格式。

matches <- toString(names(df2)[-1])
fn$sqldf("select a.*, avg(b.value) new_value
  from df1 a
  left join df2 c on a.country = c.country
  left join df1 b on a.date = b.date and b.country in ($matches)
  group by a.rowid")


varying <- list(match_country = names(df2)[-1])
long <- reshape(df2, dir = "long", varying = varying, v.names = names(varying))
sqldf("select a.*, avg(b.value) new_value
  from df1 a
  left join long c on a.country = c.country
  left join df1 b on a.date = b.date and b.country = c.match_country
  group by a.rowid")

這種 tidyverse 方法可能會有所幫助

df1
#>   country       date value blabla
#> 1      01 2020-01-01   4.0     23
#> 2      01 2020-01-02   3.0     41
#> 3      02 2020-01-02   2.0     32
#> 4      03 2020-01-02  -3.0      8
#> 5      03 2020-01-03   1.5     50
#> 6      03 2020-01-04  12.0     27
#> 7      04 2020-01-01  10.0      8
#> 8      05 2020-01-02  15.0      7

df2
#>   country match_country1 match_country2 match_country3
#> 1      01             02             03             05
#> 2      02             03             04             05
#> 3      03             01             02             04
#> 4      04             01             02             03
#> 5      05             01             03             04

suppressMessages(library(tidyverse))

df1 %>% 
  left_join(df2, by = 'country') %>% 
  nest(data = !date) %>%
  mutate(data = map(data, ~.x %>%
                      mutate(across(contains('match'), ~value[match(., country)])) %>%
                      rowwise() %>%
                      mutate(avg = mean(c_across(contains('match')), na.rm = T)) %>%
                      select(!contains('match'))
                    )
         ) %>%
  unnest(data)
#> # A tibble: 8 x 5
#>   date       country value blabla    avg
#>   <chr>      <chr>   <dbl>  <dbl>  <dbl>
#> 1 2020-01-01 01        4       23 NaN   
#> 2 2020-01-01 04       10        8   4   
#> 3 2020-01-02 01        3       41   4.67
#> 4 2020-01-02 02        2       32   6   
#> 5 2020-01-02 03       -3        8   2.5 
#> 6 2020-01-02 05       15        7   0   
#> 7 2020-01-03 03        1.5     50 NaN   
#> 8 2020-01-04 03       12       27 NaN

reprex package (v2.0.0) 於 2021 年 5 月 2 日創建

雖然已經有一個公認的答案,但這里是一個基本的 R,因為發布的兩個答案(第二個)需要外部包。

df1$new_value <- with(df1, ave(seq_len(n), date, FUN = function(i){
  mrg <- merge(df1[i, ], df2)
  j <- grep("^match", names(mrg))
  ctry <- unique(df1[i, "country"])
  apply(mrg[j], 1, function(row){
    k <- match(row, ctry)
    if(any(!is.na(k)))
      mean(mrg[k, "value"], na.rm = TRUE)
    else NA_real_
  })
}))

identical(df1$new_value, new_df$new_value)
#[1] TRUE

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM