简体   繁体   English

从 dplyr 的日期范围计算每年的观测值

[英]Counting observations in each year from a date range in dplyr

Let's say I have a data.frame consisting of industry type and starting and ending dates (eg for an employee).假设我有一个由行业类型以及开始和结束日期(例如,员工)组成的 data.frame。

mydf <- data.frame(industry = c("Government", "Education", "Military", "Private Sector", "Government", "Private Sector"),
                   start_date = c("2014-01-01", "2016-02-01", "2012-11-01", "2013-03-01", "2012-12-01", "2011-12-01"),
                   end_date = c("2020-12-01", "2016-10-01", "2014-01-01", "2016-10-01", "2015-10-01", "2014-09-01"))

> mydf
        industry start_date   end_date
1     Government 2014-01-01 2020-12-01
2      Education 2016-02-01 2016-10-01
3       Military 2012-11-01 2014-01-01
4 Private Sector 2013-03-01 2016-10-01
5     Government 2012-12-01 2015-10-01
6 Private Sector 2011-12-01 2014-09-01

I'd like to create a stacked ggplot bar chart in which each unique year in the start_date column is on the X axis (eg 2011-2016) and the y axis represents the total number of observations (the row count) represented in a given industry for that year.我想创建一个堆叠的 ggplot 条形图,其中start_date列中的每个唯一年份都在 X 轴上(例如 2011-2016),y 轴表示给定中表示的观察总数(行数)当年的行业。

I'm not sure what the right way to manipulate the data.frame to allow for this.我不确定操作 data.frame 的正确方法是什么。 Presumably I'd need to manipulate the data to have columns for industry year and count .大概我需要操纵数据以获得industry yearcount的列。 But I'm not sure how to produce the year columns from a date range.但我不确定如何从日期范围生成年份列。 Any ideas?有任何想法吗?

Convert the date columns to Date , create the 'date' seq uence from the 'start_date' to 'end_date' for each row with map2 (from purrr ), unnest the list output, count the year and plot with geom_bar将日期列转换为Date ,使用map2 (从purrr )为每一行创建从 'start_date' 到 'end_date' 的 'date' seq ,取消嵌套list unnest ,使用 geom count yeargeom_bar

library(dplyr)
library(tidyr)
library(purrr)
library(ggplot2)
mydf %>%
   mutate(across(c(start_date, end_date), as.Date)) %>% 
   transmute(industry, date = map2(start_date, end_date, seq, by = 'day')) %>% 
   unnest(c(date)) %>% 
   count(industry, year = factor(year(date))) %>%
   ggplot(aes(x = year, y = n, fill = industry)) + 
        geom_col() +
        theme_bw()

If the plot should be separate for each 'industry'如果 plot 应该为每个“行业”分开

mydf %>%
   mutate(across(c(start_date, end_date), as.Date)) %>% 
   transmute(industry, date = map2(start_date, end_date, seq, by = 'day')) %>% 
   unnest(c(date)) %>% 
   count(industry, year = factor(year(date))) %>%
   ggplot(aes(x = year, y = n, fill = industry)) + 
        geom_col() + 
        facet_wrap(~ industry) +
        theme_bw()

-output -输出

在此处输入图像描述


As @IanCampbell suggested, the by for seq can be 'year'正如@IanCampbell 建议的那样, seqby可以是'year'

mydf %>%
   mutate(across(c(start_date, end_date), as.Date)) %>% 
   transmute(industry, date = map2(start_date, end_date, seq, by = 'year')) %>% 
   unnest(c(date)) %>% 
   count(industry, year = factor(year(date))) %>%
   ggplot(aes(x = year, y = n, fill = industry)) + 
        geom_col() + 
        facet_wrap(~ industry) +
        theme_bw()

Is this what you're looking for?这是你要找的吗? I would recommend using purrr::pmap to create a new data frame with one row for each year based on each row of the original data.我建议使用purrr::pmap根据原始数据的每一行创建一个新的数据框,其中每一年都有一行。

We can use the purrr::pmap_dfr to automatically return a single data frame bound by row.我们可以使用purrr::pmap_dfr自动返回按行绑定的单个数据帧。

We can use the ~with(list(...), ) trick to be able to reference columns by name.我们可以使用~with(list(...), )技巧来按名称引用列。

Then we can use dplyr::count to count by combinations of columns.然后我们可以使用dplyr::count按列组合计数。 Then it's easy.然后很容易。

library(dplyr)
library(purrr)
library(lubridate)
library(ggplot)
mydf %>%
  mutate(across(c(start_date, end_date), as.Date),
         start_year = year(start_date),
         end_year = year(end_date)) %>%
  pmap_dfr(~with(list(...),data.frame(industry,
                                      year = seq(start_year, end_year)))) %>%
  count(year, industry) %>%
ggplot(aes(x = year, y = n, fill = industry)) + 
  geom_bar(stat="identity")

在此处输入图像描述

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM