简体   繁体   English

用 R 数据框中的最新数据填充缺失的日期

[英]Filling in missing dates with most recent data in data frame in R

I have a data frame with country, date, identifier, cumulative_identifier, cumulative_country.我有一个包含国家、日期、标识符、累积标识符、累积国家的数据框。 Country, data, and identifier are grouped.国家、数据和标识符被分组。 I, however, have countries, and identifiers with missing dates.但是,我有国家和标识符缺少日期。 These are countries that have not submitted data for these days for this identifier.这些是这些天未提交此标识符数据的国家/地区。 I would like to include these dates, but use the data from the most recent submission.我想包括这些日期,但使用最近提交的数据。

The data must be grouped by country, date, and identifier.数据必须按国家、日期和标识符分组。 For example give a data frame below.例如在下面给出一个数据框。

country       date        identifier       cumulative_id         cumulative_country
France      2021-03-20       B.1.1.7                 3528                     12158
France      2021-03-15       B.1.1.7                 3520                     12150
France      2021-03-15       B.1.2                     50                     12142
France      2021-03-14       B.1.2                     48                     12140
Morocco     2021-03-16       B.1.1.7                  232                      5636
Morocco     2020-03-01       B.1.1.7                  220                      5624

In the example above, there are many dates missing.在上面的示例中,缺少许多日期。 The dates added would use the information from the most recent submission.添加的日期将使用最近提交的信息。 So France and Morocco should look like this:所以法国和摩洛哥应该是这样的:

country          date        identifier       cumulative_id         cumulative_country
France      2021-03-20         B.1.1.7                 3528                     12158
France      2021-03-19         B.1.1.7                 3520                     12150
France      2021-03-18         B.1.1.7                 3520                     12150
France      2021-03-17         B.1.1.7                 3520                     12150
France      2021-03-16         B.1.1.7                 3520                     12150
France      2021-03-20         B.1.2                     50                     12142
France      2021-03-19         B.1.2                     50                     12142
France      2021-03-18         B.1.2                     50                     12142
France      2021-03-17         B.1.2                     50                     12142
France      2021-03-16         B.1.2                     50                     12142
France      2021-03-15         B.1.2                     50                     12142
France      2021-03-14         B.1.2                     48                     12140
France      2021-03-13         B.1.2                     48                     12140
Morocco     2021-03-20       B.1.1.7                    232                      5636
Morocco     2021-03-19       B.1.1.7                    232                      5636
Morocco     2021-03-18       B.1.1.7                    232                      5636
Morocco     2021-03-17       B.1.1.7                    232                      5636
Morocco     2021-03-16       B.1.1.7                    232                      5636
Morocco     2021-03-15       B.1.1.7                    220                      5624
...
Morocco     2021-03-01       B.1.1.7                    220                      5624

This is what I have tried with Aurèle's suggestion: The resulting date frame, however, is identical to the original, with no changes.这是我根据 Aurèle 的建议尝试过的:但是,生成的日期框架与原始日期框架相同,没有任何变化。 Again, it takes 8 minutes to complete, since there are already over 100,000 observations in the dataset.同样,这需要 8 分钟才能完成,因为数据集中已经有超过 100,000 个观察值。

horizontal$date <- as.Date(horizontal$date)


df <- df %>% 
  complete(nesting(country, pango_lineage), date = full_seq(date, 1)) %>% 
  group_by(country, pango_lineage) %>% 
  mutate(across(c(cum_country_pang, cum_country), zoo::na.locf, na.rm = FALSE)) %>% 
  filter(!is.na(cum_country_pang))

在此处输入图像描述

Using tidyr complete and zoo na.locf (Last Observation Carried Forward):使用tidyr completezoo na.locf (最后一次观察结转):

library(tidyr)
library(dplyr)

df %>% 
  complete(nesting(country, identifier), date = full_seq(date, 1)) %>% 
  group_by(country, identifier) %>% 
  mutate(across(c(cumulative_id, cumulative_country), zoo::na.locf, na.rm = FALSE)) %>% 
  filter(!is.na(cumulative_id))

#> # A tibble: 398 x 5
#> # Groups:   country, identifier [3]
#>    country identifier date       cumulative_id cumulative_country
#>    <chr>   <chr>      <date>             <int>              <int>
#>  1 France  B.1.1.7    2021-03-15          3520              12150
#>  2 France  B.1.1.7    2021-03-16          3520              12150
#>  3 France  B.1.1.7    2021-03-17          3520              12150
#>  4 France  B.1.1.7    2021-03-18          3520              12150
#>  5 France  B.1.1.7    2021-03-19          3520              12150
#>  6 France  B.1.1.7    2021-03-20          3528              12158
#>  7 France  B.1.2      2021-03-14            48              12140
#>  8 France  B.1.2      2021-03-15            50              12142
#>  9 France  B.1.2      2021-03-16            50              12142
#> 10 France  B.1.2      2021-03-17            50              12142
#> # ... with 388 more rows

Data:数据:

df <- read.table(text =
'country       date        identifier       cumulative_id         cumulative_country
France      2021-03-20       B.1.1.7                 3528                     12158
France      2021-03-15       B.1.1.7                 3520                     12150
France      2021-03-15       B.1.2                     50                     12142
France      2021-03-14       B.1.2                     48                     12140
Morocco     2021-03-16       B.1.1.7                  232                      5636
Morocco     2020-03-01       B.1.1.7                  220                      5624
', header = TRUE)
df$date <- as.Date(df$date)

Instead of zoo::na.locf , just use tidyr::fill而不是zoo::na.locf ,只需使用tidyr::fill

library(dplyr)
library(tidyr)

df %>%
    complete(nesting(country, identifier), date = full_seq(date, 1)) %>% 
    group_by(country, identifier) %>% 
    fill(c(cumulative_id, cumulative_country), .direction = "down") %>%
    filter(!is.na(cumulative_id))
#> # A tibble: 398 x 5
#> # Groups:   country, identifier [3]
#>    country identifier date       cumulative_id cumulative_country
#>    <chr>   <chr>      <date>             <int>              <int>
#>  1 France  B.1.1.7    2021-03-15          3520              12150
#>  2 France  B.1.1.7    2021-03-16          3520              12150
#>  3 France  B.1.1.7    2021-03-17          3520              12150
#>  4 France  B.1.1.7    2021-03-18          3520              12150
#>  5 France  B.1.1.7    2021-03-19          3520              12150
#>  6 France  B.1.1.7    2021-03-20          3528              12158
#>  7 France  B.1.2      2021-03-14            48              12140
#>  8 France  B.1.2      2021-03-15            50              12142
#>  9 France  B.1.2      2021-03-16            50              12142
#> 10 France  B.1.2      2021-03-17            50              12142
#> # … with 388 more rows

Created on 2021-04-02 by the reprex package (v1.0.0)代表 package (v1.0.0) 于 2021 年 4 月 2 日创建

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM