简体   繁体   English

如何用R中的时间顺序列表填充NA?

[英]How to fill NAs with chronological list in R?

Say I have a chronological list of household incomes and a dataframe of Town IDs with household incomes but there are some NAs I want to fill in.假设我有一个按时间顺序排列的家庭收入列表和一个包含家庭收入的城镇 ID 数据框,但我想填写一些 NA。

HouseholdIncome_list <- c(10000, 20000, 30000,40000,50000, 60000, 70000) 
                                                 
Town_ID <- c("A", "A", "A", "A", "B", "B", "B", "B", "B")
HouseholdIncome <- c(10000, 40000, 50000, NA, 20000, 40000, NA, NA, 60000)

df <- data.frame(Town_ID, HouseholdIncome)

  Town_ID HouseholdIncome
1       A           10000
2       A           40000
3       A           50000
4       A              NA
5       B           20000
6       B           40000
7       B              NA
8       B              NA
9       B           60000

How do fill in the NAs in the dataframe so that the missing values are the ones in the list.如何填充数据框中的 NA,以便缺失值是列表中的值。 So it looks like the df below所以它看起来像下面的 df

  Town_ID HouseholdIncome
1       A           10000
2       A           40000
3       A           50000
4       A           60000
5       B           20000
6       B           40000
7       B           50000
8       B           50000
9       B           60000

I have spent time searching for some sort of na fill option but cannot find one that realtes to a given list我花了一些时间寻找某种 na 填充选项,但找不到符合给定列表的选项

This is a terrible solution, but it get's your job done.这是一个糟糕的解决方案,但它可以完成您的工作。

library(tidyr)
library(dplyr)

df %>% 
  group_by(grp = cumsum(!is.na(HouseholdIncome))) %>% 
  rowwise() %>%
  mutate(Income = ifelse(length(which(HouseholdIncome_list == HouseholdIncome)) > 0,
    HouseholdIncome_list[which(HouseholdIncome_list == HouseholdIncome) + 1],
    NA_real_)) %>% 
  ungroup() %>% 
  fill(Income) %>% 
  mutate(HouseholdIncome = ifelse(is.na(HouseholdIncome), Income, HouseholdIncome)) %>% 
  select(Town_ID, HouseholdIncome)

returns返回

# A tibble: 9 x 2
  Town_ID HouseholdIncome
  <chr>             <dbl>
1 A                 10000
2 A                 40000
3 A                 50000
4 A                 60000
5 B                 20000
6 B                 40000
7 B                 50000
8 B                 50000
9 B                 60000

If your first item is NA this won't work.如果您的第一项是NA这将不起作用。

here is another approach based on joins that will also impute the frist value of a group in case that is missing:这是另一种基于连接的方法,如果丢失,它也将估算组的第一个值:

library(tidyverse)

rdf <- data.frame(HouseholdIncome_list = c(10000, 20000, 30000,40000,50000, 60000, 70000)) %>%
    dplyr::mutate(rn = as.double(dplyr::row_number()))
                                             
df <- data.frame(Town_ID = c("A", "A", "A", "A", "B", "B", "B", "B", "B"),
                 HouseholdIncome = c(10000, 40000, 50000, NA, 20000, 40000, NA, NA, 60000))

df %>%
    dplyr::left_join(rdf, by = c("HouseholdIncome" = "HouseholdIncome_list")) %>%
    dplyr::group_by(Town_ID) %>%
    tidyr::fill(rn, .direction = "down") %>%
    tidyr::fill(rn, .direction = "up") %>%
    dplyr::mutate(rn2 = dplyr::row_number()) %>%
    dplyr::ungroup() %>% 
    dplyr::mutate(rn = case_when(is.na(HouseholdIncome) & rn2 == 1 & rn == min(rdf$rn) ~ rn,
                                 is.na(HouseholdIncome) & rn2 == 1 ~ rn - 1,
                                 is.na(HouseholdIncome) & rn < max(rdf$rn) ~ rn + 1,
                                 TRUE ~ rn)) %>%
    dplyr::left_join(rdf, by = "rn") %>%
    select(Town_ID, HouseholdIncome = HouseholdIncome_list)

# A tibble: 9 x 2
  Town_ID HouseholdIncome
  <chr>             <dbl>
1 A                 10000
2 A                 40000
3 A                 50000
4 A                 60000
5 B                 20000
6 B                 40000
7 B                 50000
8 B                 50000
9 B                 60000

I would "cheat" a bit using tidyverse .我会使用tidyverse来“欺骗”一下。 Clearly, the Household Income is in 10.000 intervals, and we can therefore utilise this,显然,家庭收入以 10.000 为间隔,因此我们可以利用它,

df %>% mutate(
        is_na = as.numeric(is.na(HouseholdIncome)) * 10000
) %>% fill(
        HouseholdIncome, .direction = "down"
) %>% mutate(
        HouseholdIncome =(HouseholdIncome + is_na),
        is_na = NULL
)

First we check for NA , here is_na = 1 * 10000 if TRUE , and then we use fill to carry the last values forward.首先我们检查NA ,这里is_na = 1 * 10000 if TRUE ,然后我们使用fill将最后一个值向前推进。

In the end we sum our cheater variable is_na and HouseholdIncome to get the next HouseholdIncome interval.最后,我们sum了骗子变量is_naHouseholdIncome获得下一个HouseholdIncome间隔。

The output is the following,输出如下,

  Town_ID HouseholdIncome
1       A           10000
2       A           40000
3       A           50000
4       A           60000
5       B           20000
6       B           40000
7       B           50000
8       B           50000
9       B           60000

A possible base R option一个可能的基础 R 选项

transform(
    df,
    HouseholdIncome = ave(
        HouseholdIncome,
        Town_ID,
        FUN = function(x) replace(x, is.na(x), x[min(which(is.na(x))) - 1] + 1e4)
    )
)

gives

  Town_ID HouseholdIncome
1       A           10000
2       A           40000
3       A           50000
4       A           60000
5       B           20000
6       B           40000
7       B           50000
8       B           50000
9       B           60000

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM