ggplot：如何“糾正”圖中不具代表性的峰值

Question

我有沿日期時間（日期和小時：分鍾：秒）的百分比分數數據。 我想以圖形方式“糾正”/突出顯示不具有代表性的數據點。

背景

我有關於人們每天如何評價他們的幸福水平的數據，在 0 -> 1 的連續范圍內，其中 0 表示“非常不開心”，1 表示“非常開心”。 我問了很多人，希望隨着時間的推移獲得“群體中的幸福感”。

數據

library(tidyverse)
library(lubridate)

set.seed(1234)

original_df <- 
  seq(as.POSIXct('2020-09-01', tz = "UTC"), as.POSIXct('2020-09-15', tz = "UTC"), by="1 mins") %>%
  sample(15000, replace = T) %>%
  as_tibble %>%
  rename(date_time = value) %>%
  mutate(date = date(date_time)) %>%
  add_column(score = runif(15000))

original_df

## # A tibble: 15,000 x 3
##  date_time           date       score
##    <dttm>              <date>     <dbl>
##  1 2020-09-06 04:11:00 2020-09-06 0.683
##  2 2020-09-06 13:35:00 2020-09-06 0.931
##  3 2020-09-05 23:21:00 2020-09-05 0.121
##  4 2020-09-06 14:45:00 2020-09-06 0.144
##  5 2020-09-07 09:15:00 2020-09-07 0.412
##  6 2020-09-01 10:22:00 2020-09-01 0.564
##  7 2020-09-11 14:00:00 2020-09-11 0.960
##  8 2020-09-08 13:24:00 2020-09-08 0.845
##  9 2020-09-01 15:33:00 2020-09-01 0.225
## 10 2020-09-09 19:27:00 2020-09-09 0.815
## # ... with 14,990 more rows

然而，事實證明，某一天的數據點要少得多。因此，實際數據集如下所示：

actual_df <- 
  original_df %>%
  filter(date %in% as_date("2020-09-10")) %>%
  group_by(date) %>%
  slice_sample(n = 15) %>%
  ungroup %>%
  bind_rows(original_df %>% filter(!date %in% as_date("2020-09-10")))

> actual_df %>% count(date)

## # A tibble: 14 x 2
##    date           n
##    <date>     <int>
##  1 2020-09-01  1073
##  2 2020-09-02  1079
##  3 2020-09-03  1118
##  4 2020-09-04  1036
##  5 2020-09-05  1025
##  6 2020-09-06  1089
##  7 2020-09-07  1040
##  8 2020-09-08  1186
##  9 2020-09-09  1098
## 10 2020-09-10    15 ## <- this day has less data 
## 11 2020-09-11  1095
## 12 2020-09-12  1051
## 13 2020-09-13  1037
## 14 2020-09-14  1034

隨着時間的推移繪制這些數據

我一直在做的事情依賴於手段

我將每一天視為一個因素，並得到每日平均值。 從統計上講，這個解決方案可能遠非理想，正如@BrianLang 在下面評論的那樣。 但是，現在這是我選擇的方法。

library(emmeans)

model_fit <- 
  actual_df %>%
  mutate(across(date, factor)) %>%
  lm(score ~ date, data = .)

emmeans_fit_data <- emmeans(model_fit, ~ date, CIs = TRUE)

emmeans_fit_data %>%
  as_tibble %>%
  ggplot(data = ., aes(x = date, y = emmean)) +
  geom_line(color = "#1a476f", group = 1, lwd = 1) +
  geom_errorbar(aes(ymin = lower.CL, ymax = upper.CL), alpha = 0.5, color = "#90353b", width = 0.2) +
  geom_text(aes(label = paste0(round(100*emmean, 1), "%") , color = "90353b"), vjust = -4, hjust = 0.5, size = 3.5) +
  geom_point(color = "1a476f") +
  scale_y_continuous(labels = function(x) paste0(100*x, "%")) +
  ylab("Level of Happiness") +
  xlab("Date") +
  ggtitle("Mood Over Time") +
  theme(plot.title = element_text(hjust = 0.5, size = 14),
        axis.text.x=element_text(angle = -60, hjust = 0),
        axis.title.x = element_blank(),
        legend.title = element_blank(),
        plot.caption = element_text(hjust = 0, size = 8),
        legend.position = "none")

但是后來我在 2020 年 9 月 10 日得到了這個峰值，這只是由於數據點數量很少。 一種圖形解決方案是做一些事情，比如用足夠的數據點划出有問題的線並“完成”它的外觀。 也許基於前一天和后一天的平均值？ 我不想擺脫真實數據，但確實想以圖形方式強調這是不具有代表性的，並且實際值應該更接近前一天和后一天。 我在想使用虛線是一個合理的圖形解決方案。

否則，我希望可以有一種不同的方法來使用ggplot的平滑來建模/繪制這種“按時間”數據，這會給我一個更平滑的趨勢線和一個信心絲帶，可以解釋有問題的一天。 但我知道這可能超出了這個問題的范圍，所以我只是將其添加為旁注； 如果有人想根據不同的建模提出解決方案，而不僅僅是圖形修正。 但我會感謝任何一個。

Answer 1

不想進入時間序列模型，您可以想象使用受限三次樣條轉換時間變量。

我需要更改您的一些代碼，這樣我就可以避免安裝某些軟件包的最新版本；-)。

請注意，我更改了一些變量名，因為date是一個函數名，不應也用作變量名。

library(chron)

## added a numeric version of your date variable.
actual_df <- original_df %>%
 filter(datez %in% lubridate::date("2020-09-10")) %>%
 sample_n(size = 15) %>%
 group_by(datez) %>%
 ungroup %>%
 bind_rows(original_df %>% filter(!datez %in% lubridate::date("2020-09-10"))) %>%
 mutate(num_date = as.numeric(datez))

## How many knots across the dates do you want?
number_of_knots = 15

## This is to make sure that visreg is passed the actual knot locations! RMS::RCS does not store them in the model fits. 
knots <- paste0("c(", paste0(attr(rms::rcs(actual_df$num_date, number_of_knots), "parms"), collapse = ", "), ")") 

## We can construct the formula early.
formula <- as.formula(paste("score ~ rms::rcs(num_date,", knots,")"))

## fit the model as a gaussian glm and pass it to visreg for it's prediction function. This will give you predicted means and 95% CI for that mean. Then I convert the numeric dates back to real dates. 
glm_rcs <- glm(data = actual_df, formula = formula, family = "gaussian") %>% visreg::visreg(plot = F) %>% .$fit %>%
 mutate(date_date = chron::as.chron(num_date) %>% as.POSIXct())

## plot it!
ggplot(data = glm_rcs, aes(date_date,
                           y = visregFit)) + 
 geom_ribbon(aes(ymin = visregLwr, ymax = visregUpr), alpha = .5) +
 geom_line()

編輯：您按天收集數據，但您可以向日期添加抖動，以便它們在一天內分散。

actual_df <- original_df %>%
 filter(datez %in% lubridate::date("2020-09-10")) %>%
 sample_n(size = 15) %>%
 group_by(datez) %>%
 ungroup %>%
 bind_rows(original_df %>% filter(!datez %in% lubridate::date("2020-09-10"))) %>%
 mutate(num_date = as.numeric(datez))  %>%
## Here we add random noise (uniform -.5 to .5) to each numeric date.
 mutate(jittered_date = num_date + runif(n(), -.5, .5))

## You can lower this number to increase smoothing.
number_of_knots = 15

knots <- paste0("c(", paste0(attr(rms::rcs(actual_df$jittered_date, number_of_knots), "parms"), collapse = ", "), ")")

formula <- as.formula(paste("score ~ rms::rcs(jittered_date,", knots,")"))

glm_rcs <- glm(data = actual_df, formula = formula, family = "gaussian") %>% visreg::visreg(plot = F) %>% .$fit %>%
 mutate(date_date = chron::as.chron(jittered_date) %>% as.POSIXct())

ggplot(data = glm_rcs, aes(date_date,
                           y = visregFit)) +
 geom_ribbon(aes(ymin = visregLwr, ymax = visregUpr), alpha = .5) +
 geom_line()

編輯2：

如果你有一個日期時間矢量，而不是一個簡單的一天點的抖動是不是必要的。 在您創建假數據的原始代碼中，您使用lubridate::date() ，它將您的 posix 日期時間向量lubridate::date()為一個簡單的日期！ 您可以通過以下方式避免這種情況：

original_df <- tibble(datez = seq(as.POSIXct('2020-09-01', tz = "UTC"), as.POSIXct('2020-09-15', tz = "UTC"), by="1 mins") %>%
 sample(15000, replace = T)) %>%
 mutate(datez_day = lubridate::date(datez)) %>%
 add_column(score = runif(15000))

actual_df <- original_df %>%
 filter(datez_day %in% lubridate::date("2020-09-10")) %>%
 sample_n(size = 15) %>%
 bind_rows(original_df %>% filter(!datez_day %in% lubridate::date("2020-09-10"))) %>%
 mutate(num_date = as.numeric(datez))

現在你有datez_day這是天價值， datez這是一個日期，而num_date這是日期時間的數值表示形式。

從那里您可以直接在num_date上num_date而無需添加任何抖動。

number_of_knots = 20

knots <- paste0("c(", paste0(attr(rms::rcs(actual_df$num_date, number_of_knots), "parms"), collapse = ", "), ")")

formula <- as.formula(paste("score ~ rms::rcs(num_date,", knots,")"))

glm_rcs <- glm(data = actual_df, formula = formula, family = "gaussian") %>% 
        visreg::visreg(plot = F) %>% 
        .$fit %>% 
        as_tibble() %>%
   ## Translate the num_date back into a datetime object so it is correct in the figures!
        mutate(date_date = as.POSIXct.numeric(round(num_date), origin = "1970/01/01"))

ggplot(data = glm_rcs, aes(date_date,
                           y = visregFit)) +
 geom_ribbon(aes(ymin = visregLwr, ymax = visregUpr), alpha = .5) +
 geom_line()

ggplot：如何“糾正”圖中不具代表性的峰值

問題描述

背景

數據

然而，事實證明，某一天的數據點要少得多。因此，實際數據集如下所示：

隨着時間的推移繪制這些數據

我一直在做的事情依賴於手段

1 個解決方案

解決方案1
1 已采納 2020-09-29 11:09:17

ggplot：如何“糾正”圖中不具代表性的峰值

問題描述

背景

數據

然而，事實證明，某一天的數據點要少得多。 因此，實際數據集如下所示：

隨着時間的推移繪制這些數據

我一直在做的事情依賴於手段

1 個解決方案

解決方案1 1 已采納 2020-09-29 11:09:17

然而，事實證明，某一天的數據點要少得多。因此，實際數據集如下所示：

解決方案1
1 已采納 2020-09-29 11:09:17