ggplot: how to "correct" an unrepresentative spike in the plot

Question

I have data of a percent score along datetime (date and hours:minutes:seconds). I want to graphically "correct"/highlight a data point that isn't representative.

Background

I have data about how people rate their happiness level on a daily basis, on a continuous scale running 0 -> 1, where 0 means "extremely unhappy" and 1 means "extremely happy". I ask many people and want to get a sense of "happiness in the group" over time.

Data

library(tidyverse)
library(lubridate)

set.seed(1234)

original_df <- 
  seq(as.POSIXct('2020-09-01', tz = "UTC"), as.POSIXct('2020-09-15', tz = "UTC"), by="1 mins") %>%
  sample(15000, replace = T) %>%
  as_tibble %>%
  rename(date_time = value) %>%
  mutate(date = date(date_time)) %>%
  add_column(score = runif(15000))

original_df

## # A tibble: 15,000 x 3
##  date_time           date       score
##    <dttm>              <date>     <dbl>
##  1 2020-09-06 04:11:00 2020-09-06 0.683
##  2 2020-09-06 13:35:00 2020-09-06 0.931
##  3 2020-09-05 23:21:00 2020-09-05 0.121
##  4 2020-09-06 14:45:00 2020-09-06 0.144
##  5 2020-09-07 09:15:00 2020-09-07 0.412
##  6 2020-09-01 10:22:00 2020-09-01 0.564
##  7 2020-09-11 14:00:00 2020-09-11 0.960
##  8 2020-09-08 13:24:00 2020-09-08 0.845
##  9 2020-09-01 15:33:00 2020-09-01 0.225
## 10 2020-09-09 19:27:00 2020-09-09 0.815
## # ... with 14,990 more rows

However, it turns out that one of the days happens to have substantially fewer data points. Thus, the actual data set looks like the following:

actual_df <- 
  original_df %>%
  filter(date %in% as_date("2020-09-10")) %>%
  group_by(date) %>%
  slice_sample(n = 15) %>%
  ungroup %>%
  bind_rows(original_df %>% filter(!date %in% as_date("2020-09-10")))

> actual_df %>% count(date)

## # A tibble: 14 x 2
##    date           n
##    <date>     <int>
##  1 2020-09-01  1073
##  2 2020-09-02  1079
##  3 2020-09-03  1118
##  4 2020-09-04  1036
##  5 2020-09-05  1025
##  6 2020-09-06  1089
##  7 2020-09-07  1040
##  8 2020-09-08  1186
##  9 2020-09-09  1098
## 10 2020-09-10    15 ## <- this day has less data 
## 11 2020-09-11  1095
## 12 2020-09-12  1051
## 13 2020-09-13  1037
## 14 2020-09-14  1034

Plotting this data over time

What I've been doing relies on working with means

I treat every day as a factor, and get the daily mean. Statistically, this solution might be far from ideal, as @BrianLang commented below. However, right now this is the method I chose.

library(emmeans)

model_fit <- 
  actual_df %>%
  mutate(across(date, factor)) %>%
  lm(score ~ date, data = .)

emmeans_fit_data <- emmeans(model_fit, ~ date, CIs = TRUE)

emmeans_fit_data %>%
  as_tibble %>%
  ggplot(data = ., aes(x = date, y = emmean)) +
  geom_line(color = "#1a476f", group = 1, lwd = 1) +
  geom_errorbar(aes(ymin = lower.CL, ymax = upper.CL), alpha = 0.5, color = "#90353b", width = 0.2) +
  geom_text(aes(label = paste0(round(100*emmean, 1), "%") , color = "90353b"), vjust = -4, hjust = 0.5, size = 3.5) +
  geom_point(color = "1a476f") +
  scale_y_continuous(labels = function(x) paste0(100*x, "%")) +
  ylab("Level of Happiness") +
  xlab("Date") +
  ggtitle("Mood Over Time") +
  theme(plot.title = element_text(hjust = 0.5, size = 14),
        axis.text.x=element_text(angle = -60, hjust = 0),
        axis.title.x = element_blank(),
        legend.title = element_blank(),
        plot.caption = element_text(hjust = 0, size = 8),
        legend.position = "none")

But then I get this spike on 2020-09-10, which is only due to low number of data points. One graphical solution would be to do something like dashing the problematic line and "completing" how it would've looked like with enough data points. Perhaps based on averaging the day before and the day after? I don't want to get rid of the real data, but do want to graphically highlight that this is unrepresentative, and that the real value should have been much closer to the day before & after. I was thinking that using dashed lines is a reasonable graphical solution.

Otherwise, I was hoping that there could be a different method for modeling/plotting such "by-time" data using ggplot 's smoothing, that will give me a smoother trend line and a confidence ribbon that will account for the problematic day. But I understand that it might be beyond the scope of this question, so I'm just adding it as a side note; in case someone wants to suggest a solution based on different modeling, instead of a mere graphical correction. But I will be thankful for either.

Answer 1

Without wanting to get into time-series models, you could imagine transforming your time variable with restricted cubic splines.

I needed to change a bit of your code so I could avoid installing the newest versions of some packages ;-).

Notice that I changed some variable names because date is a function name, and shouldn't be used as also a variable name.

library(chron)

## added a numeric version of your date variable.
actual_df <- original_df %>%
 filter(datez %in% lubridate::date("2020-09-10")) %>%
 sample_n(size = 15) %>%
 group_by(datez) %>%
 ungroup %>%
 bind_rows(original_df %>% filter(!datez %in% lubridate::date("2020-09-10"))) %>%
 mutate(num_date = as.numeric(datez))

## How many knots across the dates do you want?
number_of_knots = 15

## This is to make sure that visreg is passed the actual knot locations! RMS::RCS does not store them in the model fits. 
knots <- paste0("c(", paste0(attr(rms::rcs(actual_df$num_date, number_of_knots), "parms"), collapse = ", "), ")") 

## We can construct the formula early.
formula <- as.formula(paste("score ~ rms::rcs(num_date,", knots,")"))

## fit the model as a gaussian glm and pass it to visreg for it's prediction function. This will give you predicted means and 95% CI for that mean. Then I convert the numeric dates back to real dates. 
glm_rcs <- glm(data = actual_df, formula = formula, family = "gaussian") %>% visreg::visreg(plot = F) %>% .$fit %>%
 mutate(date_date = chron::as.chron(num_date) %>% as.POSIXct())

## plot it!
ggplot(data = glm_rcs, aes(date_date,
                           y = visregFit)) + 
 geom_ribbon(aes(ymin = visregLwr, ymax = visregUpr), alpha = .5) +
 geom_line()

EDIT: You collect the data by day, but you could add jitter to the date such that they get spread out over the day.

actual_df <- original_df %>%
 filter(datez %in% lubridate::date("2020-09-10")) %>%
 sample_n(size = 15) %>%
 group_by(datez) %>%
 ungroup %>%
 bind_rows(original_df %>% filter(!datez %in% lubridate::date("2020-09-10"))) %>%
 mutate(num_date = as.numeric(datez))  %>%
## Here we add random noise (uniform -.5 to .5) to each numeric date.
 mutate(jittered_date = num_date + runif(n(), -.5, .5))

## You can lower this number to increase smoothing.
number_of_knots = 15

knots <- paste0("c(", paste0(attr(rms::rcs(actual_df$jittered_date, number_of_knots), "parms"), collapse = ", "), ")")

formula <- as.formula(paste("score ~ rms::rcs(jittered_date,", knots,")"))

glm_rcs <- glm(data = actual_df, formula = formula, family = "gaussian") %>% visreg::visreg(plot = F) %>% .$fit %>%
 mutate(date_date = chron::as.chron(jittered_date) %>% as.POSIXct())

ggplot(data = glm_rcs, aes(date_date,
                           y = visregFit)) +
 geom_ribbon(aes(ymin = visregLwr, ymax = visregUpr), alpha = .5) +
 geom_line()

Edit 2:

The jittering of points isn't as necessary if you have a datetime vector rather than a simple day. In your original code to create the fake data you use lubridate::date() , which takes your posix datetime vector and strips to down to a simple date! You can avoid this with something like this:

original_df <- tibble(datez = seq(as.POSIXct('2020-09-01', tz = "UTC"), as.POSIXct('2020-09-15', tz = "UTC"), by="1 mins") %>%
 sample(15000, replace = T)) %>%
 mutate(datez_day = lubridate::date(datez)) %>%
 add_column(score = runif(15000))

actual_df <- original_df %>%
 filter(datez_day %in% lubridate::date("2020-09-10")) %>%
 sample_n(size = 15) %>%
 bind_rows(original_df %>% filter(!datez_day %in% lubridate::date("2020-09-10"))) %>%
 mutate(num_date = as.numeric(datez))

now you have datez_day which is the day value, datez which is a datetime, and num_date which is a numeric representation of the datetime.

from there you can directly model on num_date without adding any jitter.

number_of_knots = 20

knots <- paste0("c(", paste0(attr(rms::rcs(actual_df$num_date, number_of_knots), "parms"), collapse = ", "), ")")

formula <- as.formula(paste("score ~ rms::rcs(num_date,", knots,")"))

glm_rcs <- glm(data = actual_df, formula = formula, family = "gaussian") %>% 
        visreg::visreg(plot = F) %>% 
        .$fit %>% 
        as_tibble() %>%
   ## Translate the num_date back into a datetime object so it is correct in the figures!
        mutate(date_date = as.POSIXct.numeric(round(num_date), origin = "1970/01/01"))

ggplot(data = glm_rcs, aes(date_date,
                           y = visregFit)) +
 geom_ribbon(aes(ymin = visregLwr, ymax = visregUpr), alpha = .5) +
 geom_line()

ggplot: how to "correct" an unrepresentative spike in the plot

Question

Background

Data

However, it turns out that one of the days happens to have substantially fewer data points. Thus, the actual data set looks like the following:

Plotting this data over time

What I've been doing relies on working with means

1 answers

solution1
1 ACCPTED 2020-09-29 11:09:17

ggplot: how to "correct" an unrepresentative spike in the plot

Question

Background

Data

However, it turns out that one of the days happens to have substantially fewer data points. Thus, the actual data set looks like the following:

Plotting this data over time

What I've been doing relies on working with means

1 answers

solution1 1 ACCPTED 2020-09-29 11:09:17

solution1
1 ACCPTED 2020-09-29 11:09:17