简体   繁体   中英

How to include missing data points in r

This problem is a spin-off from my last post ( How to calculate moving average for two years in r ).

I have a big data frame (900k rows) about mergers and acquisitions (M&As).

The df has four columns: date (when the M&A was completed), target_nation (a company of which country was merged/acquired), acquiror_nation (corporation of which country was the acquiror), and big_corp_TF (whether the acquiror was a big corporation or not, where TRUE means that corporation is big). Here is a sample of my data:

> df <- structure(list(date = c(2000L, 2000L, 2001L, 2001L, 2001L, 2002L, 
    2002L, 2002L, 2003L, 2003L, 2004L, 2004L, 2004L, 2006L, 2006L
    ), target_nation = c("Uganda", "Uganda", "Uganda", "Uganda", 
    "Uganda", "Uganda", "Uganda", "Uganda", "Uganda", "Uganda", "Uganda", 
    "Uganda", "Uganda", "Uganda", "Uganda"), acquiror_nation = c("France", 
    "Germany", "France", "France", "Germany", "France", "France", 
    "Germany", "Germany", "Germany", "France", "France", "Germany", 
    "France", "France"), big_corp_TF = c(TRUE, FALSE, TRUE, FALSE, FALSE, 
    TRUE, TRUE, TRUE, TRUE, FALSE, TRUE, FALSE, TRUE, TRUE, TRUE)), row.names = c(NA, 
    -15L))

> df
    
        date target_nation acquiror_nation big_corp_TF
     1: 2000        Uganda          France     TRUE
     2: 2000        Uganda         Germany    FALSE
     3: 2001        Uganda          France     TRUE
     4: 2001        Uganda          France    FALSE
     5: 2001        Uganda         Germany    FALSE
     6: 2002        Uganda          France     TRUE
     7: 2002        Uganda          France     TRUE
     8: 2002        Uganda         Germany     TRUE
     9: 2003        Uganda         Germany     TRUE
    10: 2003        Uganda         Germany    FALSE
    11: 2004        Uganda          France     TRUE
    12: 2004        Uganda          France    FALSE
    13: 2004        Uganda         Germany     TRUE
    14: 2006        Uganda          France     TRUE
    15: 2006        Uganda          France     TRUE

NB: There are no rows for France in 2003; and there is no year 2005.

From these data, I want to create a new variable that denotes the share of M&As done by big corporations of specific acquiror nations, counting the average for 2 years. (For my actual exercise, I will count the averages for 5 years, but let's keep things simpler here). So there would be a new variable for France's big corporations, and a new variable for Germany's big corporations.

I was suggested to use the following code:

library(runner)
library(tidyverse)
df <- df %>% as.data.frame()
param <- 'France'
df %>% 
  group_by(date, target_nation) %>%
  mutate(n1 = n()) %>%
  group_by(date, target_nation, acquiror_nation) %>%
  summarise(n1 = mean(n1),
            n2 = sum(big_corp_TF), .groups = 'drop') %>%
  filter(acquiror_nation == param) %>%
  mutate(share = sum_run(n2, k=2)/sum_run(n1, k=2))

Which outputs this tibble:

   date target_nation acquiror_nation    n1    n2 share
  <int> <chr>         <chr>           <dbl> <int> <dbl>
1  2000 Uganda        France              2     1   0.5
2  2001 Uganda        France              3     1   0.4
3  2002 Uganda        France              3     2   0.5
4  2004 Uganda        France              3     1   0.5
5  2006 Uganda        France              2     2   0.6

NB: there is no result for France for 2003 and 2005; I would like there to be results for 2003 and 2005 (because we are calculating 2-year averages and thus we should be able to have results for 2003 and 2005). Also, the share for 2006 is incorrect in reality, because it should be 1 (it should take the values of 2005 (which are 0s) rather than the values of 2004 for the calculation of average).

I would like to be able to receive the following tibble:

       date target_nation acquiror_nation    n1    n2 share
      <int> <chr>         <chr>           <dbl> <int> <dbl>
    1  2000 Uganda        France              2     1   0.5
    2  2001 Uganda        France              3     1   0.4
    3  2002 Uganda        France              3     2   0.5
    4  2003 Uganda        France              2     0   0.4
    5  2004 Uganda        France              3     1   0.2
    6  2005 Uganda        France              0     0   0.33
    7  2006 Uganda        France              2     2   1.0

NB: notice that the result for 2006 is also different (because we now take 2005 instead of 2004 for a two-year average).

I understand that this is a problem with the original data: it simply lacks certain data points. However, including them to the original data set seems to be highly inconvenient; it is probably better to include them mid-way, eg after counting the n1 and n2. But what is the most convenient way to do this?

Any suggestions are much appreciated.

use tidyr::complete along with its arguments nesting and fill . Full code that may be used.

param <- 'France'

df %>% 
  mutate(d = 1) %>%
  complete(date = seq(min(date), max(date), 1), nesting(target_nation, acquiror_nation),
           fill = list(d =0, big_corp_TF = FALSE)) %>%
  group_by(date, target_nation) %>%
  mutate(n1 = sum(d)) %>%
  group_by(date, target_nation, acquiror_nation) %>%
  summarise(n1 = mean(n1),
            n2 = sum(big_corp_TF), .groups = 'drop') %>%
  filter(acquiror_nation == param) %>%
  mutate(share = sum_run(n2, k=2)/sum_run(n1, k=2))

# A tibble: 7 x 6
   date target_nation acquiror_nation    n1    n2 share
  <dbl> <chr>         <chr>           <dbl> <int> <dbl>
1  2000 Uganda        France              2     1 0.5  
2  2001 Uganda        France              3     1 0.4  
3  2002 Uganda        France              3     2 0.5  
4  2003 Uganda        France              2     0 0.4  
5  2004 Uganda        France              3     1 0.2  
6  2005 Uganda        France              0     0 0.333
7  2006 Uganda        France              2     2 1
df2 = df %>% 
  group_by(date, target_nation) %>%
  mutate(n1 = n()) %>%
  group_by(date, target_nation, acquiror_nation) %>%
  summarise(n1 = mean(n1),
            n2 = sum(big_corp_TF), .groups = 'drop') %>%
  filter(acquiror_nation == param)

dates = seq(min(df2$date), max(df2$date), by = 1)
dates = setdiff(dates, df2$date)
df3 = df2[rep(nrow(df2), each = length(dates)), ]
df3$n1 = 0; df3$n2 = 0; df3$date = dates

df2 = arrange(rbind(df2,df3), date)
df2 = df2 %>% mutate(share = sum_run(n2, k=2)/sum_run(n1, k=2))
df2
# A tibble: 7 x 6
   date target_nation acquiror_nation    n1    n2 share
  <dbl> <fct>         <fct>           <dbl> <dbl> <dbl>
1  2000 Uganda        France              2     1 0.5  
2  2001 Uganda        France              3     1 0.4  
3  2002 Uganda        France              3     2 0.5  
4  2003 Uganda        France              0     0 0.667
5  2004 Uganda        France              3     1 0.333
6  2005 Uganda        France              0     0 0.333
7  2006 Uganda        France              2     2 1    

Explanation

First, create df2 based on your df but without calculating share . Create a sequence of dates from the minimum to the maximum:

dates = seq(min(df2$date), max(df2$date), by = 1)

Leave just the ones that are missing in df2 :

dates = setdiff(dates, df2$date)

Create a row for each missing date and set n1 and n2 to 0:

df3 = df2[rep(nrow(df2), each = length(dates)), ]
df3$n1 = 0; df3$n2 = 0; df3$date = dates

Combine the rows and sort by date:

df2 = arrange(rbind(df2,df3), date)

Finally, calculate share :

df2 = df2 %>% mutate(share = sum_run(n2, k=2)/sum_run(n1, k=2))

I apologise that this doesn't adhere to the tidyverse syntax

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM