为什么我会根据在 R 中应用 group_by() 和 distinct() 的时间得到不同的频率？

Question

I am quite new to R and the tidyverse, and I can't wrap my head around the following:我对 R 和 tidyverse 很陌生，我无法理解以下内容：

Why do I get a different frequencies depending on when I group_by() and distinct() my data?为什么根据我的group_by()和distinct()我的数据何时得到不同的频率？

output_df_1 <- input_df %>%
  mutate(created_at = lubridate::floor_date(created_at, unit = "hours")) %>%
  select(created_at, author_id) %>%
  arrange(created_at) %>%
  distinct(author_id, .keep_all = T) %>%
  group_by(created_at) %>%
  count(created_at)

output_df_2 <- input_df %>%
  mutate(created_at = lubridate::floor_date(created_at, unit = "hours")) %>%
  select(created_at, author_id) %>%
  arrange(created_at) %>%
  group_by(created_at) %>%
  distinct(author_id, .keep_all = T) %>%
  count(created_at)

full_join(output_df_1 , output_df_2 , by = "created_at") %>%
  rename(output_df_1 = n.x,
         output_df_2 = n.y) %>%
  melt(id = "created_at") %>%
  ggplot()+
  geom_line(aes(x=created_at, y=value, colour=variable),
            linetype = "solid",
            size = 0.75) +
  scale_colour_manual(values=c("#005293","#E37222"))

Context语境

input_df is a dataframe containing observations of tweets with timestamps and author_ids. input_df 是一个 dataframe，包含对带有时间戳和 author_id 的推文的观察。 I would like to produce a plot with variable1 being tweets per hour (this poses no problem) and variable2 being distict users per hour.我想生成一个 plot，变量 1 是每小时的推文（这没有问题），变量 2 是每小时的独立用户。 I am not sure which of the two lines in the above plot correcly visualizes the distinct users per hour.我不确定上面 plot 中的两行中的哪一行正确地可视化了每小时的不同用户。

Answer 1

It is because in the first code, you use distinct before group_by and count .这是因为在第一个代码中，您在group_by和count之前使用了distinct 。
Morover it is the use of group_by .此外，它是group_by的使用。 count automatically also groups: count is same as group_by(cyl) %>% summarise(freq=n()) .也自动count组： count与group_by(cyl) %>% summarise(freq=n())相同。

Here is an example:这是一个例子：

mtcars %>% 
  distinct(am, .keep_all=TRUE) %>%
  count(cyl)

mtcars %>% 
  distinct(am, .keep_all=TRUE) %>% 
  count(cyl)

gives:给出：

> mtcars %>% 
+   distinct(am, .keep_all=TRUE) %>%
+   count(cyl)
  cyl n
1   6 2
> mtcars %>% 
+   distinct(am, .keep_all=TRUE) %>% 
+   count(cyl)
  cyl n
1   6 2

If you change the order of distinct :如果您更改distinct的顺序：

mtcars %>% 
  distinct(am, .keep_all=TRUE) %>% 
  count(cyl)

mtcars %>% 
  count(cyl) %>% 
  distinct(am, .keep_all=TRUE)

you get:你得到：

 mtcars %>% 
+   distinct(am, .keep_all=TRUE) %>% 
+   count(cyl)
  cyl n
1   6 2
> 
> mtcars %>% 
+   count(cyl) %>% 
+   distinct(am, .keep_all=TRUE)
Error: `distinct()` must use existing variables.
x `am` not found in `.data`.

In your example, this code should give the same result for df1 and df2 :在您的示例中，此代码应为df1和df2提供相同的结果：

output_df_1 <- input_df %>%
  mutate(created_at = lubridate::floor_date(created_at, unit = "hours")) %>%
  select(created_at, author_id) %>%
  arrange(created_at) %>%
  distinct(author_id, .keep_all = T) %>%
  count(created_at)



output_df_2 <- input_df %>%
  mutate(created_at = lubridate::floor_date(created_at, unit = "hours")) %>%
  select(created_at, author_id) %>%
  arrange(created_at) %>%
  distinct(author_id, .keep_all = T) %>%
  count(created_at)

为什么我会根据在 R 中应用 group_by() 和 distinct() 的时间得到不同的频率？

问题描述

1 个解决方案

解决方案1
1 已采纳 2021-08-23 11:25:05

为什么我会根据在 R 中应用 group_by() 和 distinct() 的时间得到不同的频率？

问题描述

1 个解决方案

解决方案1 1 已采纳 2021-08-23 11:25:05

解决方案1
1 已采纳 2021-08-23 11:25:05