[英]Why do I get different frequencies depending of the time I apply group_by() and distinct() in R?
I am quite new to R and the tidyverse, and I can't wrap my head around the following:我对 R 和 tidyverse 很陌生,我无法理解以下内容:
Why do I get a different frequencies depending on when I group_by()
and distinct()
my data?为什么根据我的
group_by()
和distinct()
我的数据何时得到不同的频率?
output_df_1 <- input_df %>%
mutate(created_at = lubridate::floor_date(created_at, unit = "hours")) %>%
select(created_at, author_id) %>%
arrange(created_at) %>%
distinct(author_id, .keep_all = T) %>%
group_by(created_at) %>%
count(created_at)
output_df_2 <- input_df %>%
mutate(created_at = lubridate::floor_date(created_at, unit = "hours")) %>%
select(created_at, author_id) %>%
arrange(created_at) %>%
group_by(created_at) %>%
distinct(author_id, .keep_all = T) %>%
count(created_at)
full_join(output_df_1 , output_df_2 , by = "created_at") %>%
rename(output_df_1 = n.x,
output_df_2 = n.y) %>%
melt(id = "created_at") %>%
ggplot()+
geom_line(aes(x=created_at, y=value, colour=variable),
linetype = "solid",
size = 0.75) +
scale_colour_manual(values=c("#005293","#E37222"))
Context语境
input_df is a dataframe containing observations of tweets with timestamps and author_ids. input_df 是一个 dataframe,包含对带有时间戳和 author_id 的推文的观察。 I would like to produce a plot with variable1 being tweets per hour (this poses no problem) and variable2 being distict users per hour.
我想生成一个 plot,变量 1 是每小时的推文(这没有问题),变量 2 是每小时的独立用户。 I am not sure which of the two lines in the above plot correcly visualizes the distinct users per hour.
我不确定上面 plot 中的两行中的哪一行正确地可视化了每小时的不同用户。
It is because in the first code, you use distinct
before group_by
and count
.这是因为在第一个代码中,您在
group_by
和count
之前使用了distinct
。
Morover it is the use of group_by
.此外,它是
group_by
的使用。 count
automatically also groups: count
is same as group_by(cyl) %>% summarise(freq=n())
.也自动
count
组: count
与group_by(cyl) %>% summarise(freq=n())
相同。
Here is an example:这是一个例子:
mtcars %>%
distinct(am, .keep_all=TRUE) %>%
count(cyl)
mtcars %>%
distinct(am, .keep_all=TRUE) %>%
count(cyl)
gives:给出:
> mtcars %>%
+ distinct(am, .keep_all=TRUE) %>%
+ count(cyl)
cyl n
1 6 2
> mtcars %>%
+ distinct(am, .keep_all=TRUE) %>%
+ count(cyl)
cyl n
1 6 2
If you change the order of distinct
:如果您更改
distinct
的顺序:
mtcars %>%
distinct(am, .keep_all=TRUE) %>%
count(cyl)
mtcars %>%
count(cyl) %>%
distinct(am, .keep_all=TRUE)
you get:你得到:
mtcars %>%
+ distinct(am, .keep_all=TRUE) %>%
+ count(cyl)
cyl n
1 6 2
>
> mtcars %>%
+ count(cyl) %>%
+ distinct(am, .keep_all=TRUE)
Error: `distinct()` must use existing variables.
x `am` not found in `.data`.
In your example, this code should give the same result for df1
and df2
:在您的示例中,此代码应为
df1
和df2
提供相同的结果:
output_df_1 <- input_df %>%
mutate(created_at = lubridate::floor_date(created_at, unit = "hours")) %>%
select(created_at, author_id) %>%
arrange(created_at) %>%
distinct(author_id, .keep_all = T) %>%
count(created_at)
output_df_2 <- input_df %>%
mutate(created_at = lubridate::floor_date(created_at, unit = "hours")) %>%
select(created_at, author_id) %>%
arrange(created_at) %>%
distinct(author_id, .keep_all = T) %>%
count(created_at)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.