简体   繁体   中英

Recode dataframe values to NA per column

How to recode some dataframe values to NA if they don't appear in a separate vector?

More specifically, how to approach such task when:

  • each data column to clean has its specific set of "valid" values to keep, independent of other columns
  • column-specific values are given in a separate table (as vectors nested in a list-column in a tibble )

Example

  • My data to clean up is my_mtcars
  • I want to clean up certain columns ( cars , gear , and carb )
  • In each of those columns, I want to keep only certain values as they are specified in a separate table table_valid_values under valid_values . Otherwise, values not specified as "valid" should turn to NA .
  • For any column of my_mtcars that does not appear in table_valid_values , no cleanup is needed.
library(tibble)
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union


my_mtcars <- rownames_to_column(mtcars, "cars")

as_tibble(my_mtcars)
#> # A tibble: 32 x 12
#>    cars          mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
#>    <chr>       <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#>  1 Mazda RX4    21       6  160    110  3.9   2.62  16.5     0     1     4     4
#>  2 Mazda RX4 ~  21       6  160    110  3.9   2.88  17.0     0     1     4     4
#>  3 Datsun 710   22.8     4  108     93  3.85  2.32  18.6     1     1     4     1
#>  4 Hornet 4 D~  21.4     6  258    110  3.08  3.22  19.4     1     0     3     1
#>  5 Hornet Spo~  18.7     8  360    175  3.15  3.44  17.0     0     0     3     2
#>  6 Valiant      18.1     6  225    105  2.76  3.46  20.2     1     0     3     1
#>  7 Duster 360   14.3     8  360    245  3.21  3.57  15.8     0     0     3     4
#>  8 Merc 240D    24.4     4  147.    62  3.69  3.19  20       1     0     4     2
#>  9 Merc 230     22.8     4  141.    95  3.92  3.15  22.9     1     0     4     2
#> 10 Merc 280     19.2     6  168.   123  3.92  3.44  18.3     1     0     4     4
#> # ... with 22 more rows


table_valid_values <-
  structure(
    list(
      var_name = c("cars", "gear", "carb"),
      valid_values = list(
        c("Valiant", "AMC Javelin", "Ferrari Dino"),
        c(3, 5),
        c(1, 4, 6)
      )
    ),
    row.names = c(NA, -3L),
    class = c("tbl_df", "tbl", "data.frame")
  )


table_valid_values
#> # A tibble: 3 x 2
#>   var_name valid_values
#>   <chr>    <list>      
#> 1 cars     <chr [3]>   
#> 2 gear     <dbl [2]>   
#> 3 carb     <dbl [3]>

table_valid_values %>%
  pull(valid_values)
#> [[1]]
#> [1] "Valiant"      "AMC Javelin"  "Ferrari Dino"
#> 
#> [[2]]
#> [1] 3 5
#> 
#> [[3]]
#> [1] 1 4 6

Created on 2021-01-27 by the reprex package (v0.3.0)

Desired Output

Provided with only table_valid_values , how can I clean up my_mtcars to get the following:

##    cars           mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
##    <chr>        <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
##  1 NA            21       6 160     110  3.9   2.62  16.5     0     1    NA     4
##  2 NA            21       6 160     110  3.9   2.88  17.0     0     1    NA     4
##  3 NA            22.8     4 108      93  3.85  2.32  18.6     1     1    NA     1
##  4 NA            21.4     6 258     110  3.08  3.22  19.4     1     0     3     1
##  5 NA            18.7     8 360     175  3.15  3.44  17.0     0     0     3    NA
##  6 Valiant       18.1     6 225     105  2.76  3.46  20.2     1     0     3     1
##  7 NA            14.3     8 360     245  3.21  3.57  15.8     0     0     3     4
##  8 NA            24.4     4 147.     62  3.69  3.19  20       1     0    NA    NA
##  9 NA            22.8     4 141.     95  3.92  3.15  22.9     1     0    NA    NA
## 10 NA            19.2     6 168.    123  3.92  3.44  18.3     1     0    NA     4
## 11 NA            17.8     6 168.    123  3.92  3.44  18.9     1     0    NA     4
## 12 NA            16.4     8 276.    180  3.07  4.07  17.4     0     0     3    NA
## 13 NA            17.3     8 276.    180  3.07  3.73  17.6     0     0     3    NA
## 14 NA            15.2     8 276.    180  3.07  3.78  18       0     0     3    NA
## 15 NA            10.4     8 472     205  2.93  5.25  18.0     0     0     3     4
## 16 NA            10.4     8 460     215  3     5.42  17.8     0     0     3     4
## 17 NA            14.7     8 440     230  3.23  5.34  17.4     0     0     3     4
## 18 NA            32.4     4  78.7    66  4.08  2.2   19.5     1     1    NA     1
## 19 NA            30.4     4  75.7    52  4.93  1.62  18.5     1     1    NA    NA
## 20 NA            33.9     4  71.1    65  4.22  1.84  19.9     1     1    NA     1
## 21 NA            21.5     4 120.     97  3.7   2.46  20.0     1     0     3     1
## 22 NA            15.5     8 318     150  2.76  3.52  16.9     0     0     3    NA
## 23 AMC Javelin   15.2     8 304     150  3.15  3.44  17.3     0     0     3    NA
## 24 NA            13.3     8 350     245  3.73  3.84  15.4     0     0     3     4
## 25 NA            19.2     8 400     175  3.08  3.84  17.0     0     0     3    NA
## 26 NA            27.3     4  79      66  4.08  1.94  18.9     1     1    NA     1
## 27 NA            26       4 120.     91  4.43  2.14  16.7     0     1     5    NA
## 28 NA            30.4     4  95.1   113  3.77  1.51  16.9     1     1     5    NA
## 29 NA            15.8     8 351     264  4.22  3.17  14.5     0     1     5     4
## 30 Ferrari Dino  19.7     6 145     175  3.62  2.77  15.5     0     1     5     6
## 31 NA            15       8 301     335  3.54  3.57  14.6     0     1     5    NA
## 32 NA            21.4     4 121     109  4.11  2.78  18.6     1     1    NA    NA

I also wonder, what if we wanted to replace invalid values with a string of choice (say, invalid ) rather than NA ?

You could use dplyr as:

library(dplyr)

my_mtcars %>%
  mutate(across(all_of(table_valid_values$var_name), ~{
    replace(.x, !.x %in% 
            table_valid_values$valid_values[match(cur_column(), 
            table_valid_values$var_name)][[1]], NA)
  }))

Similarly, in base R:

my_mtcars[table_valid_values$var_name] <- lapply(table_valid_values$var_name, 
  function(x) {
       replace(my_mtcars[[x]], 
               !my_mtcars[[x]] %in% table_valid_values$valid_values[
               match(x, table_valid_values$var_name)][[1]], NA)
})

my_mtcars

#           cars  mpg cyl  disp  hp drat    wt  qsec vs am gear carb
#1          <NA> 21.0   6 160.0 110 3.90 2.620 16.46  0  1   NA    4
#2          <NA> 21.0   6 160.0 110 3.90 2.875 17.02  0  1   NA    4
#3          <NA> 22.8   4 108.0  93 3.85 2.320 18.61  1  1   NA    1
#4          <NA> 21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
#5          <NA> 18.7   8 360.0 175 3.15 3.440 17.02  0  0    3   NA
#6       Valiant 18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1
#7          <NA> 14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4
#8          <NA> 24.4   4 146.7  62 3.69 3.190 20.00  1  0   NA   NA
#9          <NA> 22.8   4 140.8  95 3.92 3.150 22.90  1  0   NA   NA
#10         <NA> 19.2   6 167.6 123 3.92 3.440 18.30  1  0   NA    4
#11         <NA> 17.8   6 167.6 123 3.92 3.440 18.90  1  0   NA    4
#12         <NA> 16.4   8 275.8 180 3.07 4.070 17.40  0  0    3   NA
#13         <NA> 17.3   8 275.8 180 3.07 3.730 17.60  0  0    3   NA
#14         <NA> 15.2   8 275.8 180 3.07 3.780 18.00  0  0    3   NA
#15         <NA> 10.4   8 472.0 205 2.93 5.250 17.98  0  0    3    4
#16         <NA> 10.4   8 460.0 215 3.00 5.424 17.82  0  0    3    4
#17         <NA> 14.7   8 440.0 230 3.23 5.345 17.42  0  0    3    4
#18         <NA> 32.4   4  78.7  66 4.08 2.200 19.47  1  1   NA    1
#19         <NA> 30.4   4  75.7  52 4.93 1.615 18.52  1  1   NA   NA
#20         <NA> 33.9   4  71.1  65 4.22 1.835 19.90  1  1   NA    1
#21         <NA> 21.5   4 120.1  97 3.70 2.465 20.01  1  0    3    1
#22         <NA> 15.5   8 318.0 150 2.76 3.520 16.87  0  0    3   NA
#23  AMC Javelin 15.2   8 304.0 150 3.15 3.435 17.30  0  0    3   NA
#24         <NA> 13.3   8 350.0 245 3.73 3.840 15.41  0  0    3    4
#25         <NA> 19.2   8 400.0 175 3.08 3.845 17.05  0  0    3   NA
#26         <NA> 27.3   4  79.0  66 4.08 1.935 18.90  1  1   NA    1
#27         <NA> 26.0   4 120.3  91 4.43 2.140 16.70  0  1    5   NA
#28         <NA> 30.4   4  95.1 113 3.77 1.513 16.90  1  1    5   NA
#29         <NA> 15.8   8 351.0 264 4.22 3.170 14.50  0  1    5    4
#30 Ferrari Dino 19.7   6 145.0 175 3.62 2.770 15.50  0  1    5    6
#31         <NA> 15.0   8 301.0 335 3.54 3.570 14.60  0  1    5   NA
#32         <NA> 21.4   4 121.0 109 4.11 2.780 18.60  1  1   NA   NA

Replace NA with any value you want.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM