简体   繁体   English

为什么我会根据在 R 中应用 group_by() 和 distinct() 的时间得到不同的频率?

[英]Why do I get different frequencies depending of the time I apply group_by() and distinct() in R?

I am quite new to R and the tidyverse, and I can't wrap my head around the following:我对 R 和 tidyverse 很陌生,我无法理解以下内容:

Why do I get a different frequencies depending on when I group_by() and distinct() my data?为什么根据我的group_by()distinct()我的数据何时得到不同的频率?

不同的用户频率取决于何时应用 distict 和 group_by

output_df_1 <- input_df %>%
  mutate(created_at = lubridate::floor_date(created_at, unit = "hours")) %>%
  select(created_at, author_id) %>%
  arrange(created_at) %>%
  distinct(author_id, .keep_all = T) %>%
  group_by(created_at) %>%
  count(created_at)

output_df_2 <- input_df %>%
  mutate(created_at = lubridate::floor_date(created_at, unit = "hours")) %>%
  select(created_at, author_id) %>%
  arrange(created_at) %>%
  group_by(created_at) %>%
  distinct(author_id, .keep_all = T) %>%
  count(created_at)

full_join(output_df_1 , output_df_2 , by = "created_at") %>%
  rename(output_df_1 = n.x,
         output_df_2 = n.y) %>%
  melt(id = "created_at") %>%
  ggplot()+
  geom_line(aes(x=created_at, y=value, colour=variable),
            linetype = "solid",
            size = 0.75) +
  scale_colour_manual(values=c("#005293","#E37222"))

Context语境

input_df is a dataframe containing observations of tweets with timestamps and author_ids. input_df 是一个 dataframe,包含对带有时间戳和 author_id 的推文的观察。 I would like to produce a plot with variable1 being tweets per hour (this poses no problem) and variable2 being distict users per hour.我想生成一个 plot,变量 1 是每小时的推文(这没有问题),变量 2 是每小时的独立用户。 I am not sure which of the two lines in the above plot correcly visualizes the distinct users per hour.我不确定上面 plot 中的两行中的哪一行正确地可视化了每小时的不同用户。

  1. It is because in the first code, you use distinct before group_by and count .这是因为在第一个代码中,您在group_bycount之前使用了distinct

  2. Morover it is the use of group_by .此外,它是group_by的使用。 count automatically also groups: count is same as group_by(cyl) %>% summarise(freq=n()) .也自动count组: countgroup_by(cyl) %>% summarise(freq=n())相同。

Here is an example:这是一个例子:

mtcars %>% 
  distinct(am, .keep_all=TRUE) %>%
  count(cyl)

mtcars %>% 
  distinct(am, .keep_all=TRUE) %>% 
  count(cyl)

gives:给出:

> mtcars %>% 
+   distinct(am, .keep_all=TRUE) %>%
+   count(cyl)
  cyl n
1   6 2
> mtcars %>% 
+   distinct(am, .keep_all=TRUE) %>% 
+   count(cyl)
  cyl n
1   6 2

If you change the order of distinct :如果您更改distinct的顺序:

mtcars %>% 
  distinct(am, .keep_all=TRUE) %>% 
  count(cyl)

mtcars %>% 
  count(cyl) %>% 
  distinct(am, .keep_all=TRUE)

you get:你得到:

 mtcars %>% 
+   distinct(am, .keep_all=TRUE) %>% 
+   count(cyl)
  cyl n
1   6 2
> 
> mtcars %>% 
+   count(cyl) %>% 
+   distinct(am, .keep_all=TRUE)
Error: `distinct()` must use existing variables.
x `am` not found in `.data`.

In your example, this code should give the same result for df1 and df2 :在您的示例中,此代码应为df1df2提供相同的结果:

output_df_1 <- input_df %>%
  mutate(created_at = lubridate::floor_date(created_at, unit = "hours")) %>%
  select(created_at, author_id) %>%
  arrange(created_at) %>%
  distinct(author_id, .keep_all = T) %>%
  count(created_at)



output_df_2 <- input_df %>%
  mutate(created_at = lubridate::floor_date(created_at, unit = "hours")) %>%
  select(created_at, author_id) %>%
  arrange(created_at) %>%
  distinct(author_id, .keep_all = T) %>%
  count(created_at)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM