簡體   English   中英

為什么我會根據在 R 中應用 group_by() 和 distinct() 的時間得到不同的頻率?

[英]Why do I get different frequencies depending of the time I apply group_by() and distinct() in R?

我對 R 和 tidyverse 很陌生,我無法理解以下內容:

為什么根據我的group_by()distinct()我的數據何時得到不同的頻率?

不同的用戶頻率取決於何時應用 distict 和 group_by

output_df_1 <- input_df %>%
  mutate(created_at = lubridate::floor_date(created_at, unit = "hours")) %>%
  select(created_at, author_id) %>%
  arrange(created_at) %>%
  distinct(author_id, .keep_all = T) %>%
  group_by(created_at) %>%
  count(created_at)

output_df_2 <- input_df %>%
  mutate(created_at = lubridate::floor_date(created_at, unit = "hours")) %>%
  select(created_at, author_id) %>%
  arrange(created_at) %>%
  group_by(created_at) %>%
  distinct(author_id, .keep_all = T) %>%
  count(created_at)

full_join(output_df_1 , output_df_2 , by = "created_at") %>%
  rename(output_df_1 = n.x,
         output_df_2 = n.y) %>%
  melt(id = "created_at") %>%
  ggplot()+
  geom_line(aes(x=created_at, y=value, colour=variable),
            linetype = "solid",
            size = 0.75) +
  scale_colour_manual(values=c("#005293","#E37222"))

語境

input_df 是一個 dataframe,包含對帶有時間戳和 author_id 的推文的觀察。 我想生成一個 plot,變量 1 是每小時的推文(這沒有問題),變量 2 是每小時的獨立用戶。 我不確定上面 plot 中的兩行中的哪一行正確地可視化了每小時的不同用戶。

  1. 這是因為在第一個代碼中,您在group_bycount之前使用了distinct

  2. 此外,它是group_by的使用。 也自動count組: countgroup_by(cyl) %>% summarise(freq=n())相同。

這是一個例子:

mtcars %>% 
  distinct(am, .keep_all=TRUE) %>%
  count(cyl)

mtcars %>% 
  distinct(am, .keep_all=TRUE) %>% 
  count(cyl)

給出:

> mtcars %>% 
+   distinct(am, .keep_all=TRUE) %>%
+   count(cyl)
  cyl n
1   6 2
> mtcars %>% 
+   distinct(am, .keep_all=TRUE) %>% 
+   count(cyl)
  cyl n
1   6 2

如果您更改distinct的順序:

mtcars %>% 
  distinct(am, .keep_all=TRUE) %>% 
  count(cyl)

mtcars %>% 
  count(cyl) %>% 
  distinct(am, .keep_all=TRUE)

你得到:

 mtcars %>% 
+   distinct(am, .keep_all=TRUE) %>% 
+   count(cyl)
  cyl n
1   6 2
> 
> mtcars %>% 
+   count(cyl) %>% 
+   distinct(am, .keep_all=TRUE)
Error: `distinct()` must use existing variables.
x `am` not found in `.data`.

在您的示例中,此代碼應為df1df2提供相同的結果:

output_df_1 <- input_df %>%
  mutate(created_at = lubridate::floor_date(created_at, unit = "hours")) %>%
  select(created_at, author_id) %>%
  arrange(created_at) %>%
  distinct(author_id, .keep_all = T) %>%
  count(created_at)



output_df_2 <- input_df %>%
  mutate(created_at = lubridate::floor_date(created_at, unit = "hours")) %>%
  select(created_at, author_id) %>%
  arrange(created_at) %>%
  distinct(author_id, .keep_all = T) %>%
  count(created_at)

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM