![](/img/trans.png)
[英]Why do I get an “Object not found” error when using group_by and summarise() in r?
[英]Why do I get different frequencies depending of the time I apply group_by() and distinct() in R?
我對 R 和 tidyverse 很陌生,我無法理解以下內容:
為什么根據我的group_by()
和distinct()
我的數據何時得到不同的頻率?
output_df_1 <- input_df %>%
mutate(created_at = lubridate::floor_date(created_at, unit = "hours")) %>%
select(created_at, author_id) %>%
arrange(created_at) %>%
distinct(author_id, .keep_all = T) %>%
group_by(created_at) %>%
count(created_at)
output_df_2 <- input_df %>%
mutate(created_at = lubridate::floor_date(created_at, unit = "hours")) %>%
select(created_at, author_id) %>%
arrange(created_at) %>%
group_by(created_at) %>%
distinct(author_id, .keep_all = T) %>%
count(created_at)
full_join(output_df_1 , output_df_2 , by = "created_at") %>%
rename(output_df_1 = n.x,
output_df_2 = n.y) %>%
melt(id = "created_at") %>%
ggplot()+
geom_line(aes(x=created_at, y=value, colour=variable),
linetype = "solid",
size = 0.75) +
scale_colour_manual(values=c("#005293","#E37222"))
語境
input_df 是一個 dataframe,包含對帶有時間戳和 author_id 的推文的觀察。 我想生成一個 plot,變量 1 是每小時的推文(這沒有問題),變量 2 是每小時的獨立用戶。 我不確定上面 plot 中的兩行中的哪一行正確地可視化了每小時的不同用戶。
這是因為在第一個代碼中,您在group_by
和count
之前使用了distinct
。
此外,它是group_by
的使用。 也自動count
組: count
與group_by(cyl) %>% summarise(freq=n())
相同。
這是一個例子:
mtcars %>%
distinct(am, .keep_all=TRUE) %>%
count(cyl)
mtcars %>%
distinct(am, .keep_all=TRUE) %>%
count(cyl)
給出:
> mtcars %>%
+ distinct(am, .keep_all=TRUE) %>%
+ count(cyl)
cyl n
1 6 2
> mtcars %>%
+ distinct(am, .keep_all=TRUE) %>%
+ count(cyl)
cyl n
1 6 2
如果您更改distinct
的順序:
mtcars %>%
distinct(am, .keep_all=TRUE) %>%
count(cyl)
mtcars %>%
count(cyl) %>%
distinct(am, .keep_all=TRUE)
你得到:
mtcars %>%
+ distinct(am, .keep_all=TRUE) %>%
+ count(cyl)
cyl n
1 6 2
>
> mtcars %>%
+ count(cyl) %>%
+ distinct(am, .keep_all=TRUE)
Error: `distinct()` must use existing variables.
x `am` not found in `.data`.
在您的示例中,此代碼應為df1
和df2
提供相同的結果:
output_df_1 <- input_df %>%
mutate(created_at = lubridate::floor_date(created_at, unit = "hours")) %>%
select(created_at, author_id) %>%
arrange(created_at) %>%
distinct(author_id, .keep_all = T) %>%
count(created_at)
output_df_2 <- input_df %>%
mutate(created_at = lubridate::floor_date(created_at, unit = "hours")) %>%
select(created_at, author_id) %>%
arrange(created_at) %>%
distinct(author_id, .keep_all = T) %>%
count(created_at)
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.