如何在 r 中包含缺失的數據點

Question

我有一個關於並購 (M&A) 的大數據框（90 萬行）。

df 有四列： date （並購完成時間）、 target_nation （被兼並/收購的國家/地區的公司）、 acquiror_nation （收購方是哪個國家/地區的公司）和big_corp_TF （收購方是大公司還是不是，TRUE 表示公司很大）。 這是我的數據示例：

> df <- structure(list(date = c(2000L, 2000L, 2001L, 2001L, 2001L, 2002L, 
    2002L, 2002L, 2003L, 2003L, 2004L, 2004L, 2004L, 2006L, 2006L
    ), target_nation = c("Uganda", "Uganda", "Uganda", "Uganda", 
    "Uganda", "Uganda", "Uganda", "Uganda", "Uganda", "Uganda", "Uganda", 
    "Uganda", "Uganda", "Uganda", "Uganda"), acquiror_nation = c("France", 
    "Germany", "France", "France", "Germany", "France", "France", 
    "Germany", "Germany", "Germany", "France", "France", "Germany", 
    "France", "France"), big_corp_TF = c(TRUE, FALSE, TRUE, FALSE, FALSE, 
    TRUE, TRUE, TRUE, TRUE, FALSE, TRUE, FALSE, TRUE, TRUE, TRUE)), row.names = c(NA, 
    -15L))

> df
    
        date target_nation acquiror_nation big_corp_TF
     1: 2000        Uganda          France     TRUE
     2: 2000        Uganda         Germany    FALSE
     3: 2001        Uganda          France     TRUE
     4: 2001        Uganda          France    FALSE
     5: 2001        Uganda         Germany    FALSE
     6: 2002        Uganda          France     TRUE
     7: 2002        Uganda          France     TRUE
     8: 2002        Uganda         Germany     TRUE
     9: 2003        Uganda         Germany     TRUE
    10: 2003        Uganda         Germany    FALSE
    11: 2004        Uganda          France     TRUE
    12: 2004        Uganda          France    FALSE
    13: 2004        Uganda         Germany     TRUE
    14: 2006        Uganda          France     TRUE
    15: 2006        Uganda          France     TRUE

注意： 2003 年法國沒有行； 並且沒有 2005 年。

根據這些數據，我想創建一個新變量，表示特定收購國的大公司進行的並購份額，計算 2 年的平均值。 （對於我的實際練習，我將計算 5 年的平均值，但讓我們在這里保持簡單）。 所以法國的大公司會有一個新的變量，德國的大公司會有一個新的變量。

有人建議我使用以下代碼：

library(runner)
library(tidyverse)
df <- df %>% as.data.frame()
param <- 'France'
df %>% 
  group_by(date, target_nation) %>%
  mutate(n1 = n()) %>%
  group_by(date, target_nation, acquiror_nation) %>%
  summarise(n1 = mean(n1),
            n2 = sum(big_corp_TF), .groups = 'drop') %>%
  filter(acquiror_nation == param) %>%
  mutate(share = sum_run(n2, k=2)/sum_run(n1, k=2))

輸出這個小標題：

   date target_nation acquiror_nation    n1    n2 share
  <int> <chr>         <chr>           <dbl> <int> <dbl>
1  2000 Uganda        France              2     1   0.5
2  2001 Uganda        France              3     1   0.4
3  2002 Uganda        France              3     2   0.5
4  2004 Uganda        France              3     1   0.5
5  2006 Uganda        France              2     2   0.6

注意： 2003 年和 2005 年法國沒有結果； 我希望有 2003 年和 2005 年的結果（因為我們正在計算 2 年的平均值，因此我們應該能夠獲得 2003 年和 2005 年的結果）。 另外，2006 年的份額實際上是不正確的，因為它應該是 1（它應該取 2005 年的值（即 0）而不是 2004 年的值來計算平均值）。

我希望能夠收到以下 tibble：

       date target_nation acquiror_nation    n1    n2 share
      <int> <chr>         <chr>           <dbl> <int> <dbl>
    1  2000 Uganda        France              2     1   0.5
    2  2001 Uganda        France              3     1   0.4
    3  2002 Uganda        France              3     2   0.5
    4  2003 Uganda        France              2     0   0.4
    5  2004 Uganda        France              3     1   0.2
    6  2005 Uganda        France              0     0   0.33
    7  2006 Uganda        France              2     2   1.0

注意：請注意，2006 年的結果也有所不同（因為我們現在將 2005 年而不是 2004 年作為兩年平均值）。

我知道這是原始數據的問題：它只是缺少某些數據點。 但是，將它們包含在原始數據集中似乎非常不方便； 中途包含它們可能會更好，例如在計算 n1 和 n2 之后。 但是最方便的方法是什么？

任何建議都非常感謝。

Answer 1

使用tidyr::complete及其 arguments nesting和fill 。 可以使用的完整代碼。

param <- 'France'

df %>% 
  mutate(d = 1) %>%
  complete(date = seq(min(date), max(date), 1), nesting(target_nation, acquiror_nation),
           fill = list(d =0, big_corp_TF = FALSE)) %>%
  group_by(date, target_nation) %>%
  mutate(n1 = sum(d)) %>%
  group_by(date, target_nation, acquiror_nation) %>%
  summarise(n1 = mean(n1),
            n2 = sum(big_corp_TF), .groups = 'drop') %>%
  filter(acquiror_nation == param) %>%
  mutate(share = sum_run(n2, k=2)/sum_run(n1, k=2))

# A tibble: 7 x 6
   date target_nation acquiror_nation    n1    n2 share
  <dbl> <chr>         <chr>           <dbl> <int> <dbl>
1  2000 Uganda        France              2     1 0.5  
2  2001 Uganda        France              3     1 0.4  
3  2002 Uganda        France              3     2 0.5  
4  2003 Uganda        France              2     0 0.4  
5  2004 Uganda        France              3     1 0.2  
6  2005 Uganda        France              0     0 0.333
7  2006 Uganda        France              2     2 1

Answer 2

df2 = df %>% 
  group_by(date, target_nation) %>%
  mutate(n1 = n()) %>%
  group_by(date, target_nation, acquiror_nation) %>%
  summarise(n1 = mean(n1),
            n2 = sum(big_corp_TF), .groups = 'drop') %>%
  filter(acquiror_nation == param)

dates = seq(min(df2$date), max(df2$date), by = 1)
dates = setdiff(dates, df2$date)
df3 = df2[rep(nrow(df2), each = length(dates)), ]
df3$n1 = 0; df3$n2 = 0; df3$date = dates

df2 = arrange(rbind(df2,df3), date)
df2 = df2 %>% mutate(share = sum_run(n2, k=2)/sum_run(n1, k=2))
df2
# A tibble: 7 x 6
   date target_nation acquiror_nation    n1    n2 share
  <dbl> <fct>         <fct>           <dbl> <dbl> <dbl>
1  2000 Uganda        France              2     1 0.5  
2  2001 Uganda        France              3     1 0.4  
3  2002 Uganda        France              3     2 0.5  
4  2003 Uganda        France              0     0 0.667
5  2004 Uganda        France              3     1 0.333
6  2005 Uganda        France              0     0 0.333
7  2006 Uganda        France              2     2 1

解釋

首先，根據您的df創建df2但不計算share 。 創建從最小值到最大值的日期序列：

dates = seq(min(df2$date), max(df2$date), by = 1)

只留下df2中缺少的那些：

dates = setdiff(dates, df2$date)

為每個缺失的日期創建一行並將n1和n2設置為 0：

df3 = df2[rep(nrow(df2), each = length(dates)), ]
df3$n1 = 0; df3$n2 = 0; df3$date = dates

合並行並按日期排序：

df2 = arrange(rbind(df2,df3), date)

最后，計算share ：

df2 = df2 %>% mutate(share = sum_run(n2, k=2)/sum_run(n1, k=2))

我很抱歉這不符合 tidyverse 語法

如何在 r 中包含缺失的數據點

問題描述

2 個解決方案

解決方案1
2 已采納 2021-05-03 10:39:38

解決方案2
0 2021-05-02 23:14:45

解釋

如何在 r 中包含缺失的數據點

問題描述

2 個解決方案

解決方案1 2 已采納 2021-05-03 10:39:38

解決方案2 0 2021-05-02 23:14:45

解釋

解決方案1
2 已采納 2021-05-03 10:39:38

解決方案2
0 2021-05-02 23:14:45