简体   繁体   中英

R - Recode NA with levels of a factor in grouped data

I have a data frame with a longitudinal structure as follows:

df = structure(list(oslaua = structure(c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 
 2L, 3L, 3L, 3L, 3L, 4L, 4L, 4L), .Label = c("E06000001", "E06000002", 
 "E06000003", "E06000004"), class = "factor"), wave = structure(c(1L, 
 2L, 3L, 4L, 1L, 2L, 3L, 4L, 1L, 2L, 3L, 4L, 1L, 2L, 3L), .Label = c("0", 
 "1", "2", "3"), class = "factor"), old.la = structure(c(1L, 1L, 
 NA, 1L, 2L, 2L, 2L, NA, 3L, 3L, 3L, 3L, 4L, 4L, NA), .Label = c("00EB", 
 "00EC", "00EE", "00EF"), class = "factor"), la = structure(c(1L, 
 1L, NA, 1L, 2L, 2L, 2L, NA, 3L, 3L, 3L, 3L, 4L, 4L, NA), .Label = c("Hartlepool UA", 
 "Middlesbrough UA", "Redcar and Cleveland UA", "Stockton-on-Tees UA"
 ), class = "factor"), dclg.code = structure(c(1L, 1L, NA, 1L, 
 4L, 4L, 4L, NA, 3L, 3L, 3L, 3L, 2L, 2L, NA), .Label = c("H0724", 
 "H0738", "V0728", "W0734"), class = "factor"), novo_entries = c(24L, 
 4L, 0L, 1L, 35L, 15L, 1L, 0L, 49L, 7L, 2L, 2L, 40L, 14L, 0L)), .Names = c("oslaua", 
 "wave", "old.la", "la", "dclg.code", "novo_entries"), row.names = c(NA, 
 15L), class = "data.frame")

My identifier variable is oslaua and my time variable is wave . old.la , la and dclg.code are factor variables that have NA. My goal consists of recoding my NA by with the level of each variable associated with each identifier ( oslaua ). I have tried to do this for the case of old.la using the following:

df = df %>% group_by(oslaua) %>% mutate(old.la.1 = ifelse(is.na(old.la), unique(old.la), old.la)) %>% as.data.frame()

I partially get my purpose but there are some issues as you can see:

> df
      oslaua wave old.la                      la dclg.code novo_entries old.la.1
1  E06000001    0   00EB           Hartlepool UA     H0724           24        1
2  E06000001    1   00EB           Hartlepool UA     H0724            4        1
3  E06000001    2   <NA>                    <NA>      <NA>            0        2
4  E06000001    3   00EB           Hartlepool UA     H0724            1        1
5  E06000002    0   00EC        Middlesbrough UA     W0734           35        2
6  E06000002    1   00EC        Middlesbrough UA     W0734           15        2
7  E06000002    2   00EC        Middlesbrough UA     W0734            1        2
8  E06000002    3   <NA>                    <NA>      <NA>            0        2
9  E06000003    0   00EE Redcar and Cleveland UA     V0728           49        3
10 E06000003    1   00EE Redcar and Cleveland UA     V0728            7        3
11 E06000003    2   00EE Redcar and Cleveland UA     V0728            2        3
12 E06000003    3   00EE Redcar and Cleveland UA     V0728            2        3
13 E06000004    0   00EF     Stockton-on-Tees UA     H0738           40        4
14 E06000004    1   00EF     Stockton-on-Tees UA     H0738           14        4
15 E06000004    2   <NA>                    <NA>      <NA>            0        4

Concretely, the levels of the factors change their format and also in some cases the observations are recoded wrongly (eg oslaua = E06000001 - row 3)

I do not understand why the levels change their format and how I could keep their original (alphanumeric) format. Also, why some observations are not recoded properly.

Any suggestion to address those is really appreciated.

Thanks!

Here is another option using data.table

library(data.table)
setDT(df)[, old.la1 := levels(droplevels(old.la)), by = oslaua]

For multiple columns

nm1 <-  c("old.la", "la", "dclg.code")
df1 <-  setDT(df)[, lapply(.SD, function(x) levels(droplevels(x))[1]) , 
       by = oslaua, .SDcols = nm1][df,  on = "oslaua"]
df1[, !grepl("i\\.", names(df1)), with = FALSE]

Our initial idea was

setDT(df)[, (nm1) := lapply(.SD, function(x) 
     factor(levels(droplevels(x)))) , by = oslaua, .SDcols = nm1]

But for some reason, converting to factor within each group gets some weird output having only a single level for each column in the output (using v1.10.0)

This should work for you:

library(zoo)

df %>%
  group_by(oslaua) %>%
  mutate(old.la.1 = na.locf(old.la))

It uses zoo 's last one carried forward function to replace the NA's. It's type safe. In your code, ifelse is constructing two vectors (one for the case where the test resolves to TRUE , the other for when it resolves to FALSE . To ensure compatibility, it seems that ifelse reduces each of those to the most basic, common type. In the case of factors, this is an integer (run typeof(df$old.la) .

Alternatively, a more elegant solution that avoids creating new variables would be using fill() from tidyr :

data = data %>% group_by(oslaua) %>% fill(old.la, la, dclg.code)
data

Which yields:

> data
Source: local data frame [15 x 6]
Groups: oslaua [4]

      oslaua   wave old.la                      la dclg.code novo_entries
      <fctr> <fctr> <fctr>                  <fctr>    <fctr>        <int>
1  E06000001      0   00EB           Hartlepool UA     H0724           24
2  E06000001      1   00EB           Hartlepool UA     H0724            4
3  E06000001      2   00EB           Hartlepool UA     H0724            0
4  E06000001      3   00EB           Hartlepool UA     H0724            1
5  E06000002      0   00EC        Middlesbrough UA     W0734           35
6  E06000002      1   00EC        Middlesbrough UA     W0734           15
7  E06000002      2   00EC        Middlesbrough UA     W0734            1
8  E06000002      3   00EC        Middlesbrough UA     W0734            0
9  E06000003      0   00EE Redcar and Cleveland UA     V0728           49
10 E06000003      1   00EE Redcar and Cleveland UA     V0728            7
11 E06000003      2   00EE Redcar and Cleveland UA     V0728            2
12 E06000003      3   00EE Redcar and Cleveland UA     V0728            2
13 E06000004      0   00EF     Stockton-on-Tees UA     H0738           40
14 E06000004      1   00EF     Stockton-on-Tees UA     H0738           14
15 E06000004      2   00EF     Stockton-on-Tees UA     H0738            0

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM