简体   繁体   English

在 R 中按组用 NA 填充缺失的日期 - 在日期范围的末尾也有 NA

[英]Fill in missing dates with NAs by group in R - with NA at end of date range as well

I would like to generate empty rows (NAs) for missing dates in a large dataset.我想为大型数据集中的缺失日期生成空行 (NA)。 For context, this is a large dataset where each individual (ID) has various years of data.对于上下文,这是一个大型数据集,其中每个人 (ID) 都有不同年份的数据。

Here is a simplified version of the data for two individuals:这是两个人的数据的简化版本:

table <- "ID    Date    dist.km
 1 1     2007-10-15     15147
 2 1     2007-10-16     15156
 3 1     2007-10-17     15173
 4 1     2007-10-18     15185
 5 1     2007-10-19     15194
 6 1     2007-10-25     15202
 7 1     2007-10-26     15216
 8 1     2007-10-27     15240
 9 1     2007-10-28     15270
10 1     2007-10-29     15290
11 2     2008-10-15     15147
12 2     2008-10-16     15156
13 2     2008-10-17     15173
14 2     2008-10-18     15185
15 2     2008-10-19     15194
16 2     2008-10-20     15202
17 2     2008-10-21     15216
18 2     2008-10-29     15240
19 2     2008-10-30     15270
20 2     2008-10-31     15290"

#Create a dataframe with the above table
df <- read.table(text=table, header = TRUE)
df

I first tried using complete() , here:我首先尝试使用complete() ,在这里:

library(tidyverse)

newdat <- complete(df, ID, Date)
newdat

The output of this is the original dataset, with NA values for all dates outside of the dataset date range.其 output 是原始数据集,数据集日期范围之外的所有日期的 NA 值。 So, some dates are not filled in. For example, 2007-10-20 to 2007-10-24 did not fill in for ID 1. So essentially it is filling in NA values for dates outside of my date range of distance data, but not within it.所以,有些日期没有填写。例如,2007-10-20 到 2007-10-24 没有填写 ID 1。所以基本上它是在我的距离数据日期范围之外的日期填写 NA 值,但不在其中。

I then tried this format:然后我尝试了这种格式:

library(dplyr)
library(tidyr)

newdat2 <- dat %>% group_by(ID) %>%
  complete(Date = seq.Date(min(Date), max(Date), by = "day"))
newdat2

And this worked, but this does not produce dates outside of the date range of each ID.这有效,但这不会产生每个 ID 日期范围之外的日期。 So each format produces the opposite results.因此,每种格式都会产生相反的结果。 I am looking to get at least one NA at the end of each ID's date range to show the end of this set.我希望在每个 ID 的日期范围结束时至少获得一个 NA 以显示该集合的结束。 If this can't be done using complete() , maybe my question is: how can I add a blank NA cell for a specific date in every year of my dataset?如果这不能使用complete() ,也许我的问题是:如何在我的数据集的每一年的特定日期添加一个空白的 NA 单元格? All datasets start on 10-15 and end on 02-15.所有数据集从 10-15 开始,到 02-15 结束。 So, how can I add one NA cell for 02-16 for each ID in each year?那么,如何为每个 ID 每年为 02-16 添加一个 NA 单元格?

Any help would be appreciated.任何帮助,将不胜感激。

I think you're close with your second attempt.我认为你的第二次尝试已经接近了。 If you want to manually enforce the limits of the expansion in the complete call, you can do it there.如果您想在complete调用中手动强制执行扩展限制,您可以在此处执行。 It wasn't clear what limits you were after but perhaps the below can get you there.目前尚不清楚您所追求的限制,但也许下面的内容可以让您到达那里。 Note that I used two date ranges because it seemed like you wanted to hit two time ranges.请注意,我使用了两个日期范围,因为您似乎想要达到两个时间范围。 But adjust if I misunderstood.但是如果我误解了,请调整。 Can also be called programmatically if you have those dates stored somewhere.如果您将这些日期存储在某处,也可以以编程方式调用。 Also, I converted your date column to an actual date format using as.Date() during import.此外,我在导入期间使用as.Date()将您的date列转换为实际日期格式。

library(tidyverse)

table <- "ID    Date    dist.km\n 1 1     2007-10-15     15147\n 2 1     2007-10-16     15156\n 3 1     2007-10-17     15173\n 4 1     2007-10-18     15185\n 5 1     2007-10-19     15194\n 6 1     2007-10-25     15202\n 7 1     2007-10-26     15216\n 8 1     2007-10-27     15240\n 9 1     2007-10-28     15270\n10 1     2007-10-29     15290\n11 2     2008-10-15     15147\n12 2     2008-10-16     15156\n13 2     2008-10-17     15173\n14 2     2008-10-18     15185\n15 2     2008-10-19     15194\n16 2     2008-10-20     15202\n17 2     2008-10-21     15216\n18 2     2008-10-29     15240\n19 2     2008-10-30     15270\n20 2     2008-10-31     15290"

#Create a dataframe with the above table
df <- read.table(text=table, header = TRUE) %>% 
  mutate(Date = as.Date(Date))

# expand by feeding the limits of the date ranges to cover
newdat2 <- df %>%
  group_by(ID) %>%
  complete(Date = c(
    seq.Date(
      from = as.Date("2007-10-15"),
      to = as.Date("2008-02-15"),
      by = "day"
    ),
    seq.Date(
      from = as.Date("2008-10-15"),
      to = as.Date("2009-02-15"),
      by = "day"
    )
  ))

newdat2

#> # A tibble: 496 x 3
#> # Groups:   ID [2]
#>       ID Date       dist.km
#>    <int> <date>       <int>
#>  1     1 2007-10-15   15147
#>  2     1 2007-10-16   15156
#>  3     1 2007-10-17   15173
#>  4     1 2007-10-18   15185
#>  5     1 2007-10-19   15194
#>  6     1 2007-10-20      NA
#>  7     1 2007-10-21      NA
#>  8     1 2007-10-22      NA
#>  9     1 2007-10-23      NA
#> 10     1 2007-10-24      NA
#> # ... with 486 more rows

Created on 2021-03-15 by the reprex package (v1.0.0)代表 package (v1.0.0) 于 2021 年 3 月 15 日创建

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM