简体   繁体   中英

in R replace value with most common value

I have a dataframe with time and a transport mode. There are some values, that are classified in a walking part as "unknown" or false classified as an other transport mode, for example "car"

Here is the structure

travel <- structure(list(time = structure(1:19, .Label = c("16:20:20", 
"16:20:21", "16:20:22", "16:20:23", "16:20:24", "16:20:25", "16:20:26", 
"16:20:27", "16:20:28", "16:20:29", "16:20:30", "16:20:31", "16:20:32", 
"16:20:33", "16:20:34", "16:20:35", "16:20:36", "16:20:37", "16:20:38"
), class = "factor"), mode = structure(c(3L, 3L, 3L, 2L, 2L, 
2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L, 1L, 1L, 1L, 1L, 1L), .Label = c("car", 
"bus stop", "walk"), class = "factor")), class = "data.frame", row.names = c(NA, 
-19L))

Here you can see the dataframe

    time        mode
1   16:20:20    walk
2   16:20:21    walk
3   16:20:22    walk
4   16:20:23    bus stop
5   16:20:24    bus stop
6   16:20:25    bus stop
7   16:20:26    bus stop
8   16:20:27    bus stop
9   16:20:28    walk
10  16:20:29    walk
11  16:20:30    walk
12  16:20:31    walk
13  16:20:32    walk
14  16:20:33    walk
15  16:20:34    car
16  16:20:35    car
17  16:20:36    car
18  16:20:37    car
19  16:20:38    car

I'd like to replace the values that are not correct. In the data example, the observations 4,5 and 7 have to be classified as "walk".

I think of a mutate function, that looks at the pre and post values ob the observations. if before and after the "unknown" values the value is the same mode (for examnple: walk) then the unknown values are classified as "walk". Also the observation 7 has somehow be classified as "walk".

thx for your help!

Changed your sample data a bit.

travel <- structure(list(time = structure(1:19, .Label = c("16:20:20", 
                                                           "16:20:21", "16:20:22", "16:20:23", "16:20:24", "16:20:25", "16:20:26", 
                                                           "16:20:27", "16:20:28", "16:20:29", "16:20:30", "16:20:31", "16:20:32", 
                                                           "16:20:33", "16:20:34", "16:20:35", "16:20:36", "16:20:37", "16:20:38"
), class = "factor"), mode = structure(c(3L, 3L, 3L, 2L, 2L, 
                                         3L, 1L, 3L, 3L, 3L, 3L, 3L, 3L, 2L, 1L, 1L, 1L, 1L, 1L), .Label = c("car", 
                                                                                                             "unknown", "walk"), class = "factor")), class = "data.frame", row.names = c(NA, 
                                                                                                                                                                                         -19L))
travel

       time    mode
1  16:20:20    walk
2  16:20:21    walk
3  16:20:22    walk
4  16:20:23 unknown
5  16:20:24 unknown
6  16:20:25    walk
7  16:20:26     car
8  16:20:27    walk
9  16:20:28    walk
10 16:20:29    walk
11 16:20:30    walk
12 16:20:31    walk
13 16:20:32    walk
14 16:20:33 unknown
15 16:20:34     car
16 16:20:35     car
17 16:20:36     car
18 16:20:37     car
19 16:20:38     car

Now use this syntax, for a maximum of two discrepancies

library(data.table)
travel %>% group_by(d = rleid(mode)) %>%
  mutate(d2 = n()) %>% ungroup() %>%
  mutate(mode = case_when( d2 == 1 & lag(mode) == lead(mode) ~ lag(mode),
                            d2 == 2 & first(mode[d]) == first(mode[d-1]) ~ first(mode[d-1]),
                            TRUE ~ mode)) %>%
  select(-d, -d2)

# A tibble: 19 x 2
   time     mode   
   <fct>    <fct>  
 1 16:20:20 walk   
 2 16:20:21 walk   
 3 16:20:22 walk   
 4 16:20:23 walk   
 5 16:20:24 walk   
 6 16:20:25 walk   
 7 16:20:26 walk   
 8 16:20:27 walk   
 9 16:20:28 walk   
10 16:20:29 walk   
11 16:20:30 walk   
12 16:20:31 walk   
13 16:20:32 walk   
14 16:20:33 unknown
15 16:20:34 car    
16 16:20:35 car    
17 16:20:36 car    
18 16:20:37 car    
19 16:20:38 car

The first step could be to change the mode "unknown" to NA and use zoo::na.locf to fill in with the preceding non-NA values, if that is what you want.

Secondly, you could for example define a single occurrence of another mode as something you would like to replace, again using the preceding known value. You can use rle to find such cases.

df <- read.table(text="    time        mode
1   16:20:20    walk
2   16:20:21    walk
3   16:20:22    walk
4   16:20:23    unknown
5   16:20:24    unknown
6   16:20:25    walk
7   16:20:26    car
8   16:20:27    walk
9   16:20:28    walk
10  16:20:29    walk
11  16:20:30    walk
12  16:20:31    walk
13  16:20:32    walk
14  16:20:33    walk
15  16:20:34    car
16  16:20:35    car
17  16:20:36    car
18  16:20:37    car
19  16:20:38    car
", header=T)

library(zoo)
df$mode[df$mode=="unknown"] <- NA
df$mode <- na.locf(df$mode)
rdf <- rle(df$mode)
rdf$values[which(rdf$lengths==1)] <- NA
df$mode <- na.locf(inverse.rle(rdf))
df
#>        time mode
#> 1  16:20:20 walk
#> 2  16:20:21 walk
#> 3  16:20:22 walk
#> 4  16:20:23 walk
#> 5  16:20:24 walk
#> 6  16:20:25 walk
#> 7  16:20:26 walk
#> 8  16:20:27 walk
#> 9  16:20:28 walk
#> 10 16:20:29 walk
#> 11 16:20:30 walk
#> 12 16:20:31 walk
#> 13 16:20:32 walk
#> 14 16:20:33 walk
#> 15 16:20:34  car
#> 16 16:20:35  car
#> 17 16:20:36  car
#> 18 16:20:37  car
#> 19 16:20:38  car

Created on 2021-03-23 by the reprex package (v1.0.0)

Or, more succinctly, combining the two conditions upfront:

library(zoo)
rdf <- rle(df$mode)
rdf$values[with(rdf, lengths==1 | values=="unknown")] <- NA
df$mode <- na.locf(inverse.rle(rdf))

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM