[英]Count per year with only start and end year data
我想在ggplot2
使用350个啤酒厂创建折线图。 我想每年数一下有多少家活跃的啤酒厂。 我只有啤酒厂活动的开始和结束日期。 首选tidyverse
答案。
begin_datum_jaar
是酿酒厂开始的年份。 eind_datum_jaar
是啤酒厂结束的年份。
示例数据框:
library(tidyverse)
# A tibble: 4 x 3
brouwerijnaam begin_datum_jaar eind_datum_jaar
<chr> <int> <int>
1 Brand 1340 2019
2 Heineken 1592 2019
3 Grolsche 1615 2019
4 Bavaria 1719 2010
dput:
df <- structure(list(brouwerijnaam = c("Brand", "Heineken", "Grolsche",
"Bavaria"), begin_datum_jaar = c(1340L, 1592L, 1615L, 1719L),
eind_datum_jaar = c(2019L, 2019L, 2019L, 2010L)), class = c("tbl_df",
"tbl", "data.frame"), row.names = c(NA, -4L))
所需的输出,其中etc.
是一个占位符。
# A tibble: 13 x 2
year n
<chr> <dbl>
1 1340 1
2 1341 1
3 1342 1
4 1343 1
5 etc. 1
6 1592 2
7 1593 2
8 etc. 2
9 1625 3
10 1626 3
11 1627 3
12 1628 3
13 etc. 3
可以尝试:
library(tidyverse)
df %>%
rowwise %>%
do(data.frame(brouwerij = .$brouwerijnaam,
Year = seq(.$begin_datum_jaar, .$eind_datum_jaar, by = 1))) %>%
count(Year, name = "Active breweries") %>%
ggplot(aes(x = Year, y = `Active breweries`)) +
geom_line() +
theme_minimal()
或尝试expand
第一部分:
df %>%
group_by(brouwerijnaam) %>%
expand(Year = begin_datum_jaar:eind_datum_jaar) %>%
ungroup() %>%
count(Year, name = "Active breweries")
但是,请注意按rowwise
, do
或expand
部分会占用大量资源,并且可能需要很长时间。 如果发生这种情况,我宁愿使用data.table
扩展数据框架,然后继续,如下所示:
library(data.table)
library(tidyverse)
df <- setDT(df)[, .(Year = seq(begin_datum_jaar, eind_datum_jaar, by = 1)), by = brouwerijnaam]
df %>%
count(Year, name = "Active breweries") %>%
ggplot(aes(x = Year, y = `Active breweries`)) +
geom_line() +
theme_minimal()
上面直接给出了情节。 如果您想先将其保存到数据帧(然后执行ggplot2
事情),这是主要部分(我使用data.table
进行扩展,因为根据我的经验,它快得多):
library(data.table)
library(tidyverse)
df <- setDT(df)[
, .(Year = seq(begin_datum_jaar, eind_datum_jaar, by = 1)),
by = brouwerijnaam] %>%
count(Year, name = "Active breweries")
输出:
# A tibble: 680 x 2
Year `Active breweries`
<dbl> <int>
1 1340 1
2 1341 1
3 1342 1
4 1343 1
5 1344 1
6 1345 1
7 1346 1
8 1347 1
9 1348 1
10 1349 1
# ... with 670 more rows
我们可以使用map2
来获取每个对应元素从开始日期到结束日期的顺序, unnest
list
列的unnest
,并使用count
来获取“年”的频率
library(tidyverse)
df %>%
transmute(year = map2(begin_datum_jaar, eind_datum_jaar, `:`)) %>%
unnest %>%
count(year)
# A tibble: 680 x 2
# year n
# <int> <int>
# 1 1340 1
# 2 1341 1
# 3 1342 1
# 4 1343 1
# 5 1344 1
# 6 1345 1
# 7 1346 1
# 8 1347 1
# 9 1348 1
#10 1349 1
# … with 670 more rows
或从base R
底下使用Map
table(unlist(do.call(Map, c(f = `:`, df[-1]))))
df1 <- data.frame(year=1000:2020) # Enter range for years of choice
df1 %>%
rowwise()%>%
mutate(cnt=nrow(df %>%
filter(begin_datum_jaar<year & eind_datum_jaar>year)
)
)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.