I have a data frame with a longitudinal structure as follows:
df = structure(list(oslaua = structure(c(1L, 1L, 1L, 1L, 2L, 2L, 2L,
2L, 3L, 3L, 3L, 3L, 4L, 4L, 4L), .Label = c("E06000001", "E06000002",
"E06000003", "E06000004"), class = "factor"), wave = structure(c(1L,
2L, 3L, 4L, 1L, 2L, 3L, 4L, 1L, 2L, 3L, 4L, 1L, 2L, 3L), .Label = c("0",
"1", "2", "3"), class = "factor"), old.la = structure(c(1L, 1L,
NA, 1L, 2L, 2L, 2L, NA, 3L, 3L, 3L, 3L, 4L, 4L, NA), .Label = c("00EB",
"00EC", "00EE", "00EF"), class = "factor"), la = structure(c(1L,
1L, NA, 1L, 2L, 2L, 2L, NA, 3L, 3L, 3L, 3L, 4L, 4L, NA), .Label = c("Hartlepool UA",
"Middlesbrough UA", "Redcar and Cleveland UA", "Stockton-on-Tees UA"
), class = "factor"), dclg.code = structure(c(1L, 1L, NA, 1L,
4L, 4L, 4L, NA, 3L, 3L, 3L, 3L, 2L, 2L, NA), .Label = c("H0724",
"H0738", "V0728", "W0734"), class = "factor"), novo_entries = c(24L,
4L, 0L, 1L, 35L, 15L, 1L, 0L, 49L, 7L, 2L, 2L, 40L, 14L, 0L)), .Names = c("oslaua",
"wave", "old.la", "la", "dclg.code", "novo_entries"), row.names = c(NA,
15L), class = "data.frame")
My identifier variable is oslaua
and my time variable is wave
. old.la
, la
and dclg.code
are factor variables that have NA. My goal consists of recoding my NA
by with the level of each variable associated with each identifier ( oslaua
). I have tried to do this for the case of old.la
using the following:
df = df %>% group_by(oslaua) %>% mutate(old.la.1 = ifelse(is.na(old.la), unique(old.la), old.la)) %>% as.data.frame()
I partially get my purpose but there are some issues as you can see:
> df
oslaua wave old.la la dclg.code novo_entries old.la.1
1 E06000001 0 00EB Hartlepool UA H0724 24 1
2 E06000001 1 00EB Hartlepool UA H0724 4 1
3 E06000001 2 <NA> <NA> <NA> 0 2
4 E06000001 3 00EB Hartlepool UA H0724 1 1
5 E06000002 0 00EC Middlesbrough UA W0734 35 2
6 E06000002 1 00EC Middlesbrough UA W0734 15 2
7 E06000002 2 00EC Middlesbrough UA W0734 1 2
8 E06000002 3 <NA> <NA> <NA> 0 2
9 E06000003 0 00EE Redcar and Cleveland UA V0728 49 3
10 E06000003 1 00EE Redcar and Cleveland UA V0728 7 3
11 E06000003 2 00EE Redcar and Cleveland UA V0728 2 3
12 E06000003 3 00EE Redcar and Cleveland UA V0728 2 3
13 E06000004 0 00EF Stockton-on-Tees UA H0738 40 4
14 E06000004 1 00EF Stockton-on-Tees UA H0738 14 4
15 E06000004 2 <NA> <NA> <NA> 0 4
Concretely, the levels of the factors change their format and also in some cases the observations are recoded wrongly (eg oslaua = E06000001
- row 3)
I do not understand why the levels change their format and how I could keep their original (alphanumeric) format. Also, why some observations are not recoded properly.
Any suggestion to address those is really appreciated.
Thanks!
Here is another option using data.table
library(data.table)
setDT(df)[, old.la1 := levels(droplevels(old.la)), by = oslaua]
For multiple columns
nm1 <- c("old.la", "la", "dclg.code")
df1 <- setDT(df)[, lapply(.SD, function(x) levels(droplevels(x))[1]) ,
by = oslaua, .SDcols = nm1][df, on = "oslaua"]
df1[, !grepl("i\\.", names(df1)), with = FALSE]
Our initial idea was
setDT(df)[, (nm1) := lapply(.SD, function(x)
factor(levels(droplevels(x)))) , by = oslaua, .SDcols = nm1]
But for some reason, converting to factor
within each group gets some weird output having only a single level for each column in the output (using v1.10.0)
This should work for you:
library(zoo)
df %>%
group_by(oslaua) %>%
mutate(old.la.1 = na.locf(old.la))
It uses zoo
's last one carried forward function to replace the NA's. It's type safe. In your code, ifelse
is constructing two vectors (one for the case where the test resolves to TRUE
, the other for when it resolves to FALSE
. To ensure compatibility, it seems that ifelse
reduces each of those to the most basic, common type. In the case of factors, this is an integer (run typeof(df$old.la)
.
Alternatively, a more elegant solution that avoids creating new variables would be using fill()
from tidyr
:
data = data %>% group_by(oslaua) %>% fill(old.la, la, dclg.code)
data
Which yields:
> data
Source: local data frame [15 x 6]
Groups: oslaua [4]
oslaua wave old.la la dclg.code novo_entries
<fctr> <fctr> <fctr> <fctr> <fctr> <int>
1 E06000001 0 00EB Hartlepool UA H0724 24
2 E06000001 1 00EB Hartlepool UA H0724 4
3 E06000001 2 00EB Hartlepool UA H0724 0
4 E06000001 3 00EB Hartlepool UA H0724 1
5 E06000002 0 00EC Middlesbrough UA W0734 35
6 E06000002 1 00EC Middlesbrough UA W0734 15
7 E06000002 2 00EC Middlesbrough UA W0734 1
8 E06000002 3 00EC Middlesbrough UA W0734 0
9 E06000003 0 00EE Redcar and Cleveland UA V0728 49
10 E06000003 1 00EE Redcar and Cleveland UA V0728 7
11 E06000003 2 00EE Redcar and Cleveland UA V0728 2
12 E06000003 3 00EE Redcar and Cleveland UA V0728 2
13 E06000004 0 00EF Stockton-on-Tees UA H0738 40
14 E06000004 1 00EF Stockton-on-Tees UA H0738 14
15 E06000004 2 00EF Stockton-on-Tees UA H0738 0
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.