Group by and replace a value in group if two other values in group are NA

Question

I have the following df:

df <- data.frame(comp_name = c("A", "A", "A", "A", "A", "A", "B", "B", "B", "B", "B", "B"),
                 year = c("2016", "2016", "2016", "2017","2017", "2017", "2016","2016", "2016", "2017", "2017", "2017"),
                 indicator = c("total_revenue", "overseas_revenue", "domestic_revenue", "total_revenue", "overseas_revenue", "domestic_revenue","total_revenue", "overseas_revenue", "domestic_revenue","total_revenue", "overseas_revenue", "domestic_revenue"),
                 value = c(100, NA, NA, 100, 20, 80, 90, NA, 60, 90, NA, NA))

The df looks like this:

comp_name	year	indicator	value
A	2016	total_revenue	100
A	2016	overseas_revenue	NA
A	2016	domestic_revenue	NA
A	2017	total_revenue	100
A	2017	overseas_revenue	20
A	2017	domestic_revenue	80
B	2016	total_revenue	90
B	2016	overseas_revenue	NA
B	2016	domestic_revenue	60
B	2017	total_revenue	90
B	2017	overseas_revenue	NA
B	2017	domestic_revenue	NA

I want to group by comp_name and year and apply the following rule to each group: if value for overseas_revenue AND domestic_revenue are NA, set value for domestic_revenue equal to the value for total_revenue, else do nothing.

The resulting df should look like this:

comp_name	year	indicator	value
A	2016	total_revenue	100
A	2016	overseas_revenue	NA
A	2016	domestic_revenue	100
A	2017	total_revenue	100
A	2017	overseas_revenue	20
A	2017	domestic_revenue	80
B	2016	total_revenue	90
B	2016	overseas_revenue	NA
B	2016	domestic_revenue	60
B	2017	total_revenue	90
B	2017	overseas_revenue	NA
B	2017	domestic_revenue	90

My actual dataset has 500k + rows with 12 different indicators and I haven't been able to find an approach that works. Any help would be greatly appreciated-thanks!

Answer 1

You can do it with two pivots:

library(dplyr)
library(tidyr)
  df <- data.frame(comp_name = c("A", "A", "A", "A", "A", "A", "B", "B", "B", "B", "B", "B"),
                   year = c("2016", "2016", "2016", "2017","2017", "2017", "2016","2016", "2016", "2017", "2017", "2017"),
                   indicator = c("total_revenue", "overseas_revenue", "domestic_revenue", "total_revenue", "overseas_revenue", "domestic_revenue","total_revenue", "overseas_revenue", "domestic_revenue","total_revenue", "overseas_revenue", "domestic_revenue"),
                   value = c(100, NA, NA, 100, 20, 80, 90, NA, 60, 90, NA, NA))
  df %>% 
  pivot_wider(names_from="indicator", 
              values_from = "value") %>% 
  mutate(domestic_revenue = case_when(
    is.na(overseas_revenue) & is.na(domestic_revenue) ~ total_revenue, 
    TRUE ~ domestic_revenue)) %>% 
  pivot_longer(-c(comp_name, year), 
               names_to = "indicator", 
               values_to = "value")
#> # A tibble: 12 × 4
#>    comp_name year  indicator        value
#>    <chr>     <chr> <chr>            <dbl>
#>  1 A         2016  total_revenue      100
#>  2 A         2016  overseas_revenue    NA
#>  3 A         2016  domestic_revenue   100
#>  4 A         2017  total_revenue      100
#>  5 A         2017  overseas_revenue    20
#>  6 A         2017  domestic_revenue    80
#>  7 B         2016  total_revenue       90
#>  8 B         2016  overseas_revenue    NA
#>  9 B         2016  domestic_revenue    60
#> 10 B         2017  total_revenue       90
#> 11 B         2017  overseas_revenue    NA
#> 12 B         2017  domestic_revenue    90

^{Created on 2022-04-28 by the reprex package (v2.0.1)}

Answer 2

require(tidyverse)
df %>%
  spread(indicator, value) %>%
  mutate(domestic_revenue = case_when(
    is.na(domestic_revenue) & is.na(overseas_revenue) ~ total_revenue, 
    TRUE ~ domestic_revenue
  )) %>% 
  gather(c(-comp_name, -year), key = indicator, value = value) %>%  
  arrange(comp_name, year)

# A tibble: 12 x 4
   comp_name year  indicator        value
   <chr>     <chr> <chr>            <dbl>
 1 A         2016  domestic_revenue   100
 2 A         2016  overseas_revenue    NA
 3 A         2016  total_revenue      100
 4 A         2017  domestic_revenue    80
 5 A         2017  overseas_revenue    20
 6 A         2017  total_revenue      100
 7 B         2016  domestic_revenue    60
 8 B         2016  overseas_revenue    NA
 9 B         2016  total_revenue       90
10 B         2017  domestic_revenue    90
11 B         2017  overseas_revenue    NA
12 B         2017  total_revenue       90

Answer 3

I think best to create a very simple function that handles the adjustment you want, and apply that function by group. It will be a lot faster than pivoting.

f <- function(i,v) {
  if(all(is.na(v[grepl("^(o|d)",i)]))) v[i=="domestic_revenue"]=v[i=="total_revenue"]
  return(v)
}

Using data.table (will be fast)

setDT(df)[,value:=f(indicator,value), by=.(comp_name, year)]

Using dplyr (will be slower, but still faster than pivoting)

df %>% 
  group_by(comp_name,year) %>%
  mutate(value=f(indicator,value))

Output:

    comp_name   year        indicator value
       <char> <char>           <char> <num>
 1:         A   2016    total_revenue   100
 2:         A   2016 overseas_revenue    NA
 3:         A   2016 domestic_revenue   100
 4:         A   2017    total_revenue   100
 5:         A   2017 overseas_revenue    20
 6:         A   2017 domestic_revenue    80
 7:         B   2016    total_revenue    90
 8:         B   2016 overseas_revenue    NA
 9:         B   2016 domestic_revenue    60
10:         B   2017    total_revenue    90
11:         B   2017 overseas_revenue    NA
12:         B   2017 domestic_revenue    90

Group by and replace a value in group if two other values in group are NA

Question

3 answers

solution1
0 ACCPTED 2022-04-28 12:02:16

solution2
0 2022-04-28 12:08:35

solution3
0 2022-04-28 12:20:59

Group by and replace a value in group if two other values in group are NA

Question

3 answers

solution1 0 ACCPTED 2022-04-28 12:02:16

solution2 0 2022-04-28 12:08:35

solution3 0 2022-04-28 12:20:59

solution1
0 ACCPTED 2022-04-28 12:02:16

solution2
0 2022-04-28 12:08:35

solution3
0 2022-04-28 12:20:59