简体   繁体   English

R中按组计算日期之间的差异

[英]Calculate Difference between dates by group in R

I'm using a logistic exposure to calculate hatching success for bird nests.我正在使用逻辑暴露来计算鸟巢的孵化成功率。 My data set is quite extensive and I have ~2,000 nests, each with a unique ID ("ClutchID). I need to calculate the number of days a given nest was exposed ("Exposure"), or more simply, the difference between the 1st and last day. I used the following code:我的数据集非常广泛,我有大约 2,000 个巢穴,每个巢穴都有一个唯一的 ID(“ClutchID”)。我需要计算给定巢穴暴露的天数(“暴露”),或者更简单地说,是第一天也是最后一天。我使用了以下代码:

HS_Hatch$Exposure=NA    
for(i in 2:nrow(HS_Hatch)){HS_Hatch$Exposure[i]=HS_Hatch$DateVisit[i]- HS_Hatch$DateVisit[i-1]}

where HS_Hatch is my dataset and DateVisit is the actual date.其中 HS_Hatch 是我的数据集,DateVisit 是实际日期。 The only problem is R is calculating an exposure value for the 1st date (which doesn't make sense).唯一的问题是 R 正在计算第一个日期的曝光值(这没有意义)。

What I really need is to calculate the difference between the 1st and last date for a given clutch.我真正需要的是计算给定离合器的第一个日期和最后一个日期之间的差异。 I've also looked into the following:我还研究了以下内容:

Exposure=ddply(HS_Hatch, "ClutchID", summarize, 
                     orderfrequency = as.numeric(diff.Date(DateVisit)))


df %>%
  mutate(Exposure =  as.Date(HS_Hatch$DateVisit, "%Y-%m-%d")) %>%
  group_by(ClutchID) %>%
  arrange(Exposure) %>%
  mutate(lag=lag(DateVisit), difference=DateVisit-lag)

I'm still learning R so any help would be greatly appreciated.我仍在学习 R,因此任何帮助将不胜感激。

Edit: Below is a sample of the data I'm using编辑:以下是我正在使用的数据示例

HS_Hatch <- structure(list(ClutchID = c(1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 
                                        2L, 3L, 3L, 3L, 4L, 4L, 4L, 4L, 4L, 4L, 5L, 5L, 5L, 5L, 5L, 5L
), DateVisit = c("3/15/2012", "3/18/2012", "3/20/2012", "4/1/2012", 
                 "4/3/2012", "3/18/2012", "3/20/2012", "3/22/2012", "4/3/2012", 
                 "4/4/2012", "3/22/2012", "4/3/2012", "4/4/2012", "3/18/2012", 
                 "3/20/2012", "3/22/2012", "4/2/2012", "4/3/2012", "4/4/2012", 
                 "3/20/2012", "3/22/2012", "3/25/2012", "3/27/2012", "4/4/2012", 
                 "4/5/2012"), Year = c(2012L, 2012L, 2012L, 2012L, 2012L, 2012L, 
                                       2012L, 2012L, 2012L, 2012L, 2012L, 2012L, 2012L, 2012L, 2012L, 
                                       2012L, 2012L, 2012L, 2012L, 2012L, 2012L, 2012L, 2012L, 2012L, 
                                       2012L), Survive = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
                                                           1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L)), class = c("tbl_df", 
                                                                                                                               "tbl", "data.frame"), row.names = c(NA, -25L), .Names = c("ClutchID", 
                                                                                                                                                                                         "DateVisit", "Year", "Survive"), spec = structure(list(cols = structure(list(
                                                                                                                                                                                             ClutchID = structure(list(), class = c("collector_integer", 
                                                                                                                                                                                                                                    "collector")), DateVisit = structure(list(), class = c("collector_character", 
                                                                                                                                                                                                                                                                                           "collector")), Year = structure(list(), class = c("collector_integer", 
                                                                                                                                                                                                                                                                                                                                             "collector")), Survive = structure(list(), class = c("collector_integer", 
                                                                                                                                                                                                                                                                                                                                                                                                  "collector"))), .Names = c("ClutchID", "DateVisit", "Year", 
                                                                                                                                                                                                                                                                                                                                                                                                                             "Survive")), default = structure(list(), class = c("collector_guess", 
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                "collector"))), .Names = c("cols", "default"), class = "col_spec"))

Collecting some of the comments...收集一些评论...

Load dplyr加载dplyr

We need only the dplyr package for this problem.我们只需要dplyr包来解决这个问题。 If we load other packages, eg plyr , it can cause conflicts if both packages have functions with the same name.如果我们加载其他包,例如plyr ,如果两个包都具有相同名称的函数,则可能会导致冲突。 Let's load only dplyr .让我们只加载dplyr

library(dplyr)

In the future, you may wish to load tidyverse instead -- it includes dplyr and other related packages, for graphics, etc.将来,您可能希望加载tidyverse它包括dplyr和其他相关包,用于图形等。

Converting dates转换日期

Let's convert the DateVisit variable from character strings to something R can interpret as a date.让我们将DateVisit变量从字符串转换为 R 可以解释为日期的内容。 Once we do this, it allows R to calculate differences in days by subtracting two dates from each other.一旦我们这样做,它允许 R 通过将两个日期相减来计算天数差异。

HS_Hatch <- HS_Hatch %>%
 mutate(date_visit = as.Date(DateVisit, "%m/%d/%Y"))

The date format %m/%d/%Y is different from your original code.日期格式%m/%d/%Y与您的原始代码不同。 This date format needs to match how dates look in your data.此日期格式需要与日期在数据中的外观相匹配。 DateVisit has dates as month/day/year, so we use %m/%d/%Y . DateVisit日期为月/日/年,因此我们使用%m/%d/%Y

Also, you don't need to specify the dataset for DateVisit inside mutate , as in HS_Hatch$DateVisit , because it's already looking in HS_Hatch .此外,您不需要在mutateDateVisit指定数据集,就像在HS_Hatch$DateVisit ,因为它已经在HS_HatchHS_Hatch The code HS_Hatch %>% ... says 'use HS_Hatch for the following steps'.代码HS_Hatch %>% ...表示“将HS_Hatch用于以下步骤”。

Calculating exposures计算暴露

To calculate exposure, we need to find the first date, last date, and then the difference between the two, for each set of rows by ClutchID .要计算曝光度,我们需要通过ClutchID为每组行找到第一个日期、最后一个日期,然后找到两者之间的差异。 We use summarize , which collapses the data to one row per ClutchID .我们使用summarize ,它将数据折叠为每个ClutchID一行。

exposure <- HS_Hatch %>% 
    group_by(ClutchID) %>%
    summarize(first_visit = min(date_visit), 
              last_visit = max(date_visit), 
              exposure = last_visit - first_visit)

first_visit = min(date_visit) will find the minimum date_visit for each ClutchID separately, since we are using group_by(ClutchID) . first_visit = min(date_visit)将分别找到每个ClutchID的最小date_visit ,因为我们使用的是group_by(ClutchID)

exposure = last_visit - first_visit takes the newly-calculated first_visit and last_visit and finds the difference in days. exposure = last_visit - first_visit采用新计算的first_visitlast_visit并找出天数差异。

This creates the following result:这将创建以下结果:

  ClutchID first_visit last_visit exposure
     <int>      <date>     <date>    <dbl>
1        1  2012-03-15 2012-04-03       19
2        2  2012-03-18 2012-04-04       17
3        3  2012-03-22 2012-04-04       13
4        4  2012-03-18 2012-04-04       17
5        5  2012-03-20 2012-04-05       16

If you want to keep all the original rows, you can use mutate in place of summarize .如果要保留所有原始行,可以使用mutate代替summarize

Here is a similar solutions if you look for a difftime results in days, from a vector date , without NA values produce in the new column, and if you expect to group by several conditions/groups.这是一个类似的解决方案,如果您从向量date查找以天为单位的 difftime 结果,而在新列中没有 NA 值,并且您希望按多个条件/组进行分组。

make sure that your vector of date as been converting in the good format as previously explained.确保您的日期向量以之前解释的良好格式进行转换。

dat2 <- dat %>% 
select(group1, group2, date) %>% 
arrange(group1, group2, date) %>% 
group_by(group1, group2) %>% 
mutate(diff_date = c(0,diff(date)))

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM