简体   繁体   English

R:基于要素水平和年份的条件汇总

[英]R: conditional aggregate based on factor level and year

I have a dataset in R which I am trying to aggregate by column level and year which looks like this: 我在R中有一个数据集,我试图按列级别和年份进行汇总,如下所示:

    City  State   Year   Status      Year_repealed   PolicyNo
    Pitt   PA     2001   InForce                        6
    Phil.  PA     2001   Repealed        2004           9
    Pitt   PA     2002   InForce                        7
    Pitt   PA     2005   InForce                        2

What I would like to create is where for each Year, I aggregate the PolicyNo across states taking into account the date the policy was repealed. 我想创建的是每年,在考虑到废除该政策的日期的情况下,跨州汇总PolicyNo。 The results I would then get is: 我得到的结果是:

    Year    State PolicyNo
    2001     PA     15  
    2002     PA     22
    2003     PA     22
    2004     PA     12 
    2005     PA     14

I am not sure how to go about splitting and aggregating the data conditional on the repeal data and was wondering if there is a way to achieve this is R easily. 我不确定如何根据废除数据拆分和聚合数据,并且想知道是否有一种方法可以轻松实现R。

It may help you to break this up into two distinct problems. 它可以帮助您将其分解为两个不同的问题。

  1. Get a table that shows the change in PolicyNo in every city-state-year. 获取一个表,该表显示每个城市状态年份中PolicyNo的变化。
  2. Summarize that table to show the PolicyNo in each state-year. 汇总该表以显示每个州年度的PolicyNo。

To accomplish (1) we add the missing years with NA PolicyNo, and add repeals as negative PolicyNo observations. 要完成(1),我们用NA PolicyNo添加缺失的年份,并将废除添加为PolicyNo观察值。

library(dplyr)

df = structure(list(City = c("Pitt", "Phil.", "Pitt", "Pitt"), State = c("PA", "PA", "PA", "PA"), Year = c(2001L, 2001L, 2002L, 2005L), Status = c("InForce", "Repealed", "InForce", "InForce"), Year_repealed = c(NA, 2004L, NA, NA), PolicyNo = c(6L, 9L, 7L, 2L)), .Names = c("City", "State", "Year", "Status", "Year_repealed", "PolicyNo"), class = "data.frame", row.names = c(NA, -4L))

repeals = df %>%
  filter(!is.na(Year_repealed)) %>%
  mutate(Year = Year_repealed, PolicyNo = -1 * PolicyNo)
repeals
#    City State Year   Status Year_repealed PolicyNo
# 1 Phil.    PA 2004 Repealed          2004       -9

all_years = expand.grid(City = unique(df$City), State = unique(df$State),
                        Year = 2001:2005)

df = bind_rows(df, repeals, all_years)
#     City State Year   Status Year_repealed PolicyNo
# 1   Pitt    PA 2001  InForce            NA        6
# 2  Phil.    PA 2001 Repealed          2004        9
# 3   Pitt    PA 2002  InForce            NA        7
# 4   Pitt    PA 2005  InForce            NA        2
# 5  Phil.    PA 2004 Repealed          2004       -9
# 6   Pitt    PA 2001     <NA>            NA       NA
# 7  Phil.    PA 2001     <NA>            NA       NA
# 8   Pitt    PA 2002     <NA>            NA       NA
# 9  Phil.    PA 2002     <NA>            NA       NA
# 10  Pitt    PA 2003     <NA>            NA       NA
# 11 Phil.    PA 2003     <NA>            NA       NA
# 12  Pitt    PA 2004     <NA>            NA       NA
# 13 Phil.    PA 2004     <NA>            NA       NA
# 14  Pitt    PA 2005     <NA>            NA       NA
# 15 Phil.    PA 2005     <NA>            NA       NA

Now the table shows every city-state-year and incorporates repeals. 现在,该表显示了每个城市州的年份,并包含废除。 This is a table we can summarize. 这是我们可以总结的表格。

df = df %>%
  group_by(Year, State) %>%
  summarize(annual_change = sum(PolicyNo, na.rm = TRUE))
df
# Source: local data frame [5 x 3]
# Groups: Year [?]
# 
#    Year State annual_change
#   <int> <chr>         <dbl>
# 1  2001    PA            15
# 2  2002    PA             7
# 3  2003    PA             0
# 4  2004    PA            -9
# 5  2005    PA             2

That gets us PolicyNo change in each state-year. 这使我们的政策在每个州年度保持不变。 A cumulative sum over the changes gets us levels. 这些变化的累积总和使我们获得了水平。

df = df %>%
  ungroup() %>%
  mutate(PolicyNo = cumsum(annual_change))
df
# # A tibble: 5 × 4
#    Year State annual_change PolicyNo
#   <int> <chr>         <dbl>    <dbl>
# 1  2001    PA            15       15
# 2  2002    PA             7       22
# 3  2003    PA             0       22
# 4  2004    PA            -9       13
# 5  2005    PA             2       15

With the data.table package you could do it as follows: 使用data.table包,您可以按照以下步骤操作:

melt(setDT(dat), 
     measure.vars = c(3,5),
     value.name = 'Year',
     value.factor = FALSE)[!is.na(Year)
                           ][variable == 'Year_repealed', PolicyNo := -1*PolicyNo
                             ][CJ(Year = min(Year):max(Year), State = State, unique = TRUE), on = .(Year, State)
                               ][is.na(PolicyNo), PolicyNo := 0
                                 ][, .(PolicyNo = sum(PolicyNo)), by = .(Year, State)
                                   ][, .(Year, State, PolicyNo = cumsum(PolicyNo))]

The result of the above code: 上面代码的结果:

   Year State PolicyNo
1: 2001    PA       15
2: 2002    PA       22
3: 2003    PA       22
4: 2004    PA       13
5: 2005    PA       15

As you can see, there are several steps needed to come to the desired endresult: 如您所见,要达到所需的最终结果,需要执行几个步骤:

  • First you convert to a data.table ( setDT(dat) ) and reshape this into long format and remove the rows with no Year 首先,您将转换为data.table( setDT(dat) )并将其重塑为长格式,并删除没有Year的行
  • Then you make the value for the rows that have 'Year_repealed' to negative. 然后,将'Year_repealed'的行的值设置为负数。
  • With a cross-join ( CJ ) you make sure that alle the years for each state are present and convert the NA -values in the PolicyNo column to zero. 使用交叉联接( CJ ),请确保存在每个州的所有年份,并将PolicyNo列中的NA转换为零。
  • Finally, you summarise by year and do a cumulative sum on the result. 最后,您可以按年份进行汇总,并对结果进行累加。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM