如何用R中的时间顺序列表填充NA？

Question

Say I have a chronological list of household incomes and a dataframe of Town IDs with household incomes but there are some NAs I want to fill in.假设我有一个按时间顺序排列的家庭收入列表和一个包含家庭收入的城镇 ID 数据框，但我想填写一些 NA。

HouseholdIncome_list <- c(10000, 20000, 30000,40000,50000, 60000, 70000) 
                                                 
Town_ID <- c("A", "A", "A", "A", "B", "B", "B", "B", "B")
HouseholdIncome <- c(10000, 40000, 50000, NA, 20000, 40000, NA, NA, 60000)

df <- data.frame(Town_ID, HouseholdIncome)

  Town_ID HouseholdIncome
1       A           10000
2       A           40000
3       A           50000
4       A              NA
5       B           20000
6       B           40000
7       B              NA
8       B              NA
9       B           60000

How do fill in the NAs in the dataframe so that the missing values are the ones in the list.如何填充数据框中的 NA，以便缺失值是列表中的值。 So it looks like the df below所以它看起来像下面的 df

  Town_ID HouseholdIncome
1       A           10000
2       A           40000
3       A           50000
4       A           60000
5       B           20000
6       B           40000
7       B           50000
8       B           50000
9       B           60000

I have spent time searching for some sort of na fill option but cannot find one that realtes to a given list我花了一些时间寻找某种 na 填充选项，但找不到符合给定列表的选项

Answer 1

This is a terrible solution, but it get's your job done.这是一个糟糕的解决方案，但它可以完成您的工作。

library(tidyr)
library(dplyr)

df %>% 
  group_by(grp = cumsum(!is.na(HouseholdIncome))) %>% 
  rowwise() %>%
  mutate(Income = ifelse(length(which(HouseholdIncome_list == HouseholdIncome)) > 0,
    HouseholdIncome_list[which(HouseholdIncome_list == HouseholdIncome) + 1],
    NA_real_)) %>% 
  ungroup() %>% 
  fill(Income) %>% 
  mutate(HouseholdIncome = ifelse(is.na(HouseholdIncome), Income, HouseholdIncome)) %>% 
  select(Town_ID, HouseholdIncome)

returns返回

# A tibble: 9 x 2
  Town_ID HouseholdIncome
  <chr>             <dbl>
1 A                 10000
2 A                 40000
3 A                 50000
4 A                 60000
5 B                 20000
6 B                 40000
7 B                 50000
8 B                 50000
9 B                 60000

If your first item is NA this won't work.如果您的第一项是NA这将不起作用。

Answer 2

here is another approach based on joins that will also impute the frist value of a group in case that is missing:这是另一种基于连接的方法，如果丢失，它也将估算组的第一个值：

library(tidyverse)

rdf <- data.frame(HouseholdIncome_list = c(10000, 20000, 30000,40000,50000, 60000, 70000)) %>%
    dplyr::mutate(rn = as.double(dplyr::row_number()))
                                             
df <- data.frame(Town_ID = c("A", "A", "A", "A", "B", "B", "B", "B", "B"),
                 HouseholdIncome = c(10000, 40000, 50000, NA, 20000, 40000, NA, NA, 60000))

df %>%
    dplyr::left_join(rdf, by = c("HouseholdIncome" = "HouseholdIncome_list")) %>%
    dplyr::group_by(Town_ID) %>%
    tidyr::fill(rn, .direction = "down") %>%
    tidyr::fill(rn, .direction = "up") %>%
    dplyr::mutate(rn2 = dplyr::row_number()) %>%
    dplyr::ungroup() %>% 
    dplyr::mutate(rn = case_when(is.na(HouseholdIncome) & rn2 == 1 & rn == min(rdf$rn) ~ rn,
                                 is.na(HouseholdIncome) & rn2 == 1 ~ rn - 1,
                                 is.na(HouseholdIncome) & rn < max(rdf$rn) ~ rn + 1,
                                 TRUE ~ rn)) %>%
    dplyr::left_join(rdf, by = "rn") %>%
    select(Town_ID, HouseholdIncome = HouseholdIncome_list)

# A tibble: 9 x 2
  Town_ID HouseholdIncome
  <chr>             <dbl>
1 A                 10000
2 A                 40000
3 A                 50000
4 A                 60000
5 B                 20000
6 B                 40000
7 B                 50000
8 B                 50000
9 B                 60000

Answer 3

I would "cheat" a bit using tidyverse .我会使用tidyverse来“欺骗”一下。 Clearly, the Household Income is in 10.000 intervals, and we can therefore utilise this,显然，家庭收入以 10.000 为间隔，因此我们可以利用它，

df %>% mutate(
        is_na = as.numeric(is.na(HouseholdIncome)) * 10000
) %>% fill(
        HouseholdIncome, .direction = "down"
) %>% mutate(
        HouseholdIncome =(HouseholdIncome + is_na),
        is_na = NULL
)

First we check for NA , here is_na = 1 * 10000 if TRUE , and then we use fill to carry the last values forward.首先我们检查NA ，这里is_na = 1 * 10000 if TRUE ，然后我们使用fill将最后一个值向前推进。

In the end we sum our cheater variable is_na and HouseholdIncome to get the next HouseholdIncome interval.最后，我们sum了骗子变量is_na和HouseholdIncome获得下一个HouseholdIncome间隔。

The output is the following,输出如下，

  Town_ID HouseholdIncome
1       A           10000
2       A           40000
3       A           50000
4       A           60000
5       B           20000
6       B           40000
7       B           50000
8       B           50000
9       B           60000

Answer 4

A possible base R option一个可能的基础 R 选项

transform(
    df,
    HouseholdIncome = ave(
        HouseholdIncome,
        Town_ID,
        FUN = function(x) replace(x, is.na(x), x[min(which(is.na(x))) - 1] + 1e4)
    )
)

gives给

  Town_ID HouseholdIncome
1       A           10000
2       A           40000
3       A           50000
4       A           60000
5       B           20000
6       B           40000
7       B           50000
8       B           50000
9       B           60000

如何用R中的时间顺序列表填充NA？

问题描述

4 个解决方案

解决方案1
2 2021-07-22 22:15:05

解决方案2
2 已采纳 2021-07-22 22:32:17

解决方案3
2 2021-07-22 22:34:55

解决方案4
2 2021-07-22 22:58:48

如何用R中的时间顺序列表填充NA？

问题描述

4 个解决方案

解决方案1 2 2021-07-22 22:15:05

解决方案2 2 已采纳 2021-07-22 22:32:17

解决方案3 2 2021-07-22 22:34:55

解决方案4 2 2021-07-22 22:58:48

解决方案1
2 2021-07-22 22:15:05

解决方案2
2 已采纳 2021-07-22 22:32:17

解决方案3
2 2021-07-22 22:34:55

解决方案4
2 2021-07-22 22:58:48