简体   繁体   中英

Group by and replace a value in group if two other values in group are NA

I have the following df:

df <- data.frame(comp_name = c("A", "A", "A", "A", "A", "A", "B", "B", "B", "B", "B", "B"),
                 year = c("2016", "2016", "2016", "2017","2017", "2017", "2016","2016", "2016", "2017", "2017", "2017"),
                 indicator = c("total_revenue", "overseas_revenue", "domestic_revenue", "total_revenue", "overseas_revenue", "domestic_revenue","total_revenue", "overseas_revenue", "domestic_revenue","total_revenue", "overseas_revenue", "domestic_revenue"),
                 value = c(100, NA, NA, 100, 20, 80, 90, NA, 60, 90, NA, NA))

The df looks like this:

comp_name year indicator value
A 2016 total_revenue 100
A 2016 overseas_revenue NA
A 2016 domestic_revenue NA
A 2017 total_revenue 100
A 2017 overseas_revenue 20
A 2017 domestic_revenue 80
B 2016 total_revenue 90
B 2016 overseas_revenue NA
B 2016 domestic_revenue 60
B 2017 total_revenue 90
B 2017 overseas_revenue NA
B 2017 domestic_revenue NA

I want to group by comp_name and year and apply the following rule to each group: if value for overseas_revenue AND domestic_revenue are NA, set value for domestic_revenue equal to the value for total_revenue, else do nothing.

The resulting df should look like this:

comp_name year indicator value
A 2016 total_revenue 100
A 2016 overseas_revenue NA
A 2016 domestic_revenue 100
A 2017 total_revenue 100
A 2017 overseas_revenue 20
A 2017 domestic_revenue 80
B 2016 total_revenue 90
B 2016 overseas_revenue NA
B 2016 domestic_revenue 60
B 2017 total_revenue 90
B 2017 overseas_revenue NA
B 2017 domestic_revenue 90

My actual dataset has 500k + rows with 12 different indicators and I haven't been able to find an approach that works. Any help would be greatly appreciated-thanks!

You can do it with two pivots:

library(dplyr)
library(tidyr)
  df <- data.frame(comp_name = c("A", "A", "A", "A", "A", "A", "B", "B", "B", "B", "B", "B"),
                   year = c("2016", "2016", "2016", "2017","2017", "2017", "2016","2016", "2016", "2017", "2017", "2017"),
                   indicator = c("total_revenue", "overseas_revenue", "domestic_revenue", "total_revenue", "overseas_revenue", "domestic_revenue","total_revenue", "overseas_revenue", "domestic_revenue","total_revenue", "overseas_revenue", "domestic_revenue"),
                   value = c(100, NA, NA, 100, 20, 80, 90, NA, 60, 90, NA, NA))
  df %>% 
  pivot_wider(names_from="indicator", 
              values_from = "value") %>% 
  mutate(domestic_revenue = case_when(
    is.na(overseas_revenue) & is.na(domestic_revenue) ~ total_revenue, 
    TRUE ~ domestic_revenue)) %>% 
  pivot_longer(-c(comp_name, year), 
               names_to = "indicator", 
               values_to = "value")
#> # A tibble: 12 × 4
#>    comp_name year  indicator        value
#>    <chr>     <chr> <chr>            <dbl>
#>  1 A         2016  total_revenue      100
#>  2 A         2016  overseas_revenue    NA
#>  3 A         2016  domestic_revenue   100
#>  4 A         2017  total_revenue      100
#>  5 A         2017  overseas_revenue    20
#>  6 A         2017  domestic_revenue    80
#>  7 B         2016  total_revenue       90
#>  8 B         2016  overseas_revenue    NA
#>  9 B         2016  domestic_revenue    60
#> 10 B         2017  total_revenue       90
#> 11 B         2017  overseas_revenue    NA
#> 12 B         2017  domestic_revenue    90

Created on 2022-04-28 by the reprex package (v2.0.1)

require(tidyverse)
df %>%
  spread(indicator, value) %>%
  mutate(domestic_revenue = case_when(
    is.na(domestic_revenue) & is.na(overseas_revenue) ~ total_revenue, 
    TRUE ~ domestic_revenue
  )) %>% 
  gather(c(-comp_name, -year), key = indicator, value = value) %>%  
  arrange(comp_name, year)

# A tibble: 12 x 4
   comp_name year  indicator        value
   <chr>     <chr> <chr>            <dbl>
 1 A         2016  domestic_revenue   100
 2 A         2016  overseas_revenue    NA
 3 A         2016  total_revenue      100
 4 A         2017  domestic_revenue    80
 5 A         2017  overseas_revenue    20
 6 A         2017  total_revenue      100
 7 B         2016  domestic_revenue    60
 8 B         2016  overseas_revenue    NA
 9 B         2016  total_revenue       90
10 B         2017  domestic_revenue    90
11 B         2017  overseas_revenue    NA
12 B         2017  total_revenue       90

I think best to create a very simple function that handles the adjustment you want, and apply that function by group. It will be a lot faster than pivoting.

f <- function(i,v) {
  if(all(is.na(v[grepl("^(o|d)",i)]))) v[i=="domestic_revenue"]=v[i=="total_revenue"]
  return(v)
}

Using data.table (will be fast)

setDT(df)[,value:=f(indicator,value), by=.(comp_name, year)]

Using dplyr (will be slower, but still faster than pivoting)

df %>% 
  group_by(comp_name,year) %>%
  mutate(value=f(indicator,value))

Output:

    comp_name   year        indicator value
       <char> <char>           <char> <num>
 1:         A   2016    total_revenue   100
 2:         A   2016 overseas_revenue    NA
 3:         A   2016 domestic_revenue   100
 4:         A   2017    total_revenue   100
 5:         A   2017 overseas_revenue    20
 6:         A   2017 domestic_revenue    80
 7:         B   2016    total_revenue    90
 8:         B   2016 overseas_revenue    NA
 9:         B   2016 domestic_revenue    60
10:         B   2017    total_revenue    90
11:         B   2017 overseas_revenue    NA
12:         B   2017 domestic_revenue    90

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM