简体   繁体   English

如何按R中的组汇总日期数据

[英]How to summarize date data by groups in R

I would like to summarize the following sample data into a new dataframe as follows: 我想将以下示例数据总结为一个新的数据框,如下所示:

Population, Sample Size (N), Percent Completed (%) 人口,样本量(N),完成百分比(%)

Sample Size is a count of all records for each population. 样本数量是每个人口的所有记录的计数。 I can do this using the table command or tapply. 我可以使用table命令或轻按来执行此操作。 Percent completed is the percentage of records with 'End Date's (all records without 'End Date' are assumed to not complete. This is where I am lost! 完成百分比是带有“结束日期”的记录的百分比(假定所有没有“结束日期”的记录都没有完成。这是我迷路的地方!

Sample Data 样本数据

 sample <- structure(list(Population = structure(c(1L, 1L, 1L, 1L, 1L, 2L, 
    2L, 2L, 3L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L, 1L, 1L, 1L, 1L, 1L, 
    1L, 2L, 2L, 3L, 3L, 3L, 3L, 1L, 1L, 3L, 3L, 3L, 3L), .Label = c("Glommen", 
    "Kaseberga", "Steninge"), class = "factor"), Start_Date = structure(c(16032, 
    16032, 16032, 16032, 16032, 16036, 16036, 16036, 16037, 16038, 
    16038, 16039, 16039, 16039, 16039, 16039, 16039, 16041, 16041, 
    16041, 16041, 16041, 16041, 16044, 16044, 16045, 16045, 16045, 
    16045, 16048, 16048, 16048, 16048, 16048, 16048), class = "Date"), 
        End_Date = structure(c(NA, 16037, NA, NA, 16036, 16043, 16040, 
        16041, 16042, 16042, 16042, 16043, 16043, 16043, 16043, 16043, 
        16043, 16045, 16045, 16045, 16045, 16045, NA, 16048, 16048, 
        16049, 16049, NA, NA, 16052, 16052, 16052, 16052, 16052, 
        16052), class = "Date")), .Names = c("Population", "Start_Date", 
    "End_Date"), row.names = c(NA, 35L), class = "data.frame")

You can do this with split/apply/combine: 您可以使用split / apply / combine来做到这一点:

spl = split(sample, sample$Population)
new.rows = lapply(spl, function(x) data.frame(Population=x$Population[1],
                                              SampleSize=nrow(x),
                                              PctComplete=sum(!is.na(x$End_Date))/nrow(x)))
combined = do.call(rbind, new.rows)
combined

#           Population SampleSize PctComplete
# Glommen      Glommen         13   0.6923077
# Kaseberga  Kaseberga          7   1.0000000
# Steninge    Steninge         15   0.8666667

One word of warning: sample is the name of a base function, so you should pick a different name for your data frame. 一句话警告: sample是基本函数的名称,因此您应该为数据框选择一个不同的名称。

It's easy with the plyr package: 使用plyr软件包很容易:

library(plyr)
ddply(sample, .(Population), summarize, 
      Sample_Size = length(End_Date),
      Percent_Completed = mean(!is.na(End_Date)) * 100)

#   Population Sample_Size Percent_Completed
# 1    Glommen          13          69.23077
# 2  Kaseberga           7         100.00000
# 3   Steninge          15          86.66667

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM