简体   繁体   中英

Adding missing values after completing a tibble

I've got a tibble of data in long format where some observations are missing --- think sales of different articles on different dates, where an article might not have been sold on a given day at all.

I want to complete() this data so that all combinations of article and date are present in the tibble (this is for creating area plots, since ggplot 's geom_area() apparently averages neighboring values for the missing data otherwise).

The original tibble contains both an ID number and a textual label for each article, and I would like to also add the same textual labels to the complete() d tibble (this is so that I'll then be able to glue() together a more informative label for a subsequent plot); I'm unsure how to do this in a tidy manner, however.

The mapping of ID numbers to labels is available in another tibble.

Here is an artificial example:

library(tidyverse)

articleData  = tibble(article = 1:3, description = c("Foo", "Bar", "Baz"), price = c(1, 2, 2))

article     <- c(1:3, 1, 3, 1:3)
description <- map_chr(article, ~ articleData[[., "description"]])
date        <- as.Date(c(rep("2023-01-01", times = 3), rep("2023-01-02", times = 2), rep("2023-01-03", times = 3)))
sales       <- c(21, 58, 35, 32, 47, 25, 47, 42)
price       <- map_int(article, ~ articleData[[., "price"]])

data <- tibble(
    Date        = date,
    Article     = as_factor(article),
    ArticleName = description,
    Sales       = sales,
    Price       = price,
    Turnover    = sales * price
)
data

data %>%
    group_by(Date) %>%
    summarize(
        Date          = unique(Date),
        TurnoverTotal = sum(Turnover),
        SalesTotal    = sum(Sales)
    ) %>%
    full_join(data) %>%
    mutate(
        TurnoverShare = Turnover / TurnoverTotal,
        SalesShare    = Sales / SalesTotal
    ) ->
    data

data

# This does not work.
data %>% ggplot(mapping = aes(x = Date, y = TurnoverShare, fill = Article)) +
    geom_area()

#  Explicitly set missing shares to zero.
data %>% complete(
    Date,
    Article,
    fill = list(
        SalesShare    = 0,
        TurnoverShare = 0
    )
) -> data2

data2

# This works.
data2 %>% ggplot(mapping = aes(x = Date, y = TurnoverShare, fill = Article)) +
    geom_area()

In this example, article 2 (the "Bar") was not sold no 2023-01-02, so ArticleName is NA in the relevant row in data2 . Is there a way to directly set ArticleName based on Article within the call to complete() , using the articleData tibble? Alternatively, is there a good (tidy) way of adding this afterward?

Thank you.

You can achieve this using complete() provided that the names and types match in both data frames ( complete() is powered in part by a join afterall).

library(dplyr)
library(tidyr)

# Name and types must match in x and y data frames
articleData <- tibble(article = 1:3, description = c("Foo", "Bar", "Baz"), price = c(1, 2, 2)) %>%
  rename(Article = article, ArticleName = description, Price = price) %>%
  mutate(Article = factor(Article))

data %>%
  complete(
    Date,
    articleData,
    fill = list(
      SalesShare    = 0,
      TurnoverShare = 0
    )
  )
  
# A tibble: 9 × 10
  Date       Article ArticleName Price TurnoverTotal SalesTotal Sales Turnover TurnoverShare SalesShare
  <date>     <fct>   <chr>       <dbl>         <dbl>      <dbl> <dbl>    <dbl>         <dbl>      <dbl>
1 2023-01-01 1       Foo             1           207        114    21       21         0.101      0.184
2 2023-01-01 2       Bar             2           207        114    58      116         0.560      0.509
3 2023-01-01 3       Baz             2           207        114    35       70         0.338      0.307
4 2023-01-02 1       Foo             1           126         79    32       32         0.254      0.405
5 2023-01-02 2       Bar             2            NA         NA    NA       NA         0          0    
6 2023-01-02 3       Baz             2           126         79    47       94         0.746      0.595
7 2023-01-03 1       Foo             1           203        114    25       25         0.123      0.219
8 2023-01-03 2       Bar             2           203        114    47       94         0.463      0.412
9 2023-01-03 3       Baz             2           203        114    42       84         0.414      0.368

Naturally, one possible way of doing this occurred to me five minutes after posting this question; here I can do

data %>% complete(
    Date,
    Article,
    fill = list(
        SalesShare    = 0,
        TurnoverShare = 0
    )) %>%
    mutate(
        ArticleName = map_chr(as.integer(Article), ~ articleData[[., "description"]]),
        Price       = map_int(as.integer(Article), ~ articleData[[., "price"]])
    ) -> data2

when creating the complete() 'd tibble --- as.integer() being the key requirement. I would still welcome better, tidier and/or more elegant solutions, however.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM