将一些 dataframe 值转换为 NA：要转换的值取决于列，并在单独的列表中给出

Question

在 dataframe 中，我想将一些值转换为NA 。 哪些值应该变为NA取决于列。 此基于列的值规范在单独的列表 object 中给出。 我想写一个 function 将接受：

一个 dataframe 待清理
指定要清理的列的向量
指定每个值的列表对于每列都可以

并将返回一个干净的 dataframe，其中不需要的值变为NA 。 虽然这样的任务可以通过for循环来完成，但我试图弄清楚是否有更简单的迭代方式来完成它。 我通常喜欢tidyverse解决方案，但会对任何想法感到满意。

示例数据

在以下数据集中，每一列都有自己的一组有效值，应该保留，并且 rest 应该变为NA 。

library(tibble)

set.seed(2020)

## generate random strings: https://stackoverflow.com/a/42734863/6105259
sample_strings <- function(n = 5000) {
  a <- do.call(paste0, replicate(5, sample(letters, n, TRUE), FALSE))
  paste0(a, sprintf("%04d", sample(9999, n, TRUE)), sample(letters, n, TRUE))
}

df <-
  tibble(id = 1:40,
         color = sample(c(1:5), size = 40, replace = TRUE),
         political = sample(c(sample(c("republican", "democrat", "green_party", "libertarian"), size = 20, replace = TRUE),
                              sample_strings(20))),
         religion_status = sample(c(sample(c("secular", "traditional", "religious", "atheist", "agnostic"), size = 20, replace = TRUE), 
                                    sample_strings(20)))
         )

## # A tibble: 40 x 4
##       id color political  religion_status
##    <int> <int> <chr>      <chr>          
##  1     1     4 republican fzwue3975k     
##  2     2     4 republican mgxoe2220e     
##  3     3     1 democrat   secular        
##  4     4     1 republican secular        
##  5     5     4 aibcg6459y oqnfv1461t     
##  6     6     2 aedqi0739y ufhua9648n     
##  7     7     1 zgvox0771x agnostic       
##  8     8     5 democrat   traditional    
##  9     9     2 republican atheist        
## 10    10     2 oxgge5657l nktsl2136o     
## # ... with 30 more rows

从以下列表中了解哪些值应保存在哪一列中：

var_mapping_list <- list(preferences = list(age = list(originType = "NumberQuestionPage", 
    originIndex = 6L, title = "what is your age?", valueDescriptions = NULL), 
    political = list(originType = "QuestionPage", originIndex = 7L, 
        title = "what is your political affiliation?", valueDescriptions = list(
            republican = "I have voted most of my life to the republican party", 
            democrat = "I have voted most of my life to the democratic party", 
            other = "other")), religion_status = list(originType = "QuestionPage", 
        originIndex = 9L, title = "how do you define your religiousness level? ", 
        valueDescriptions = list(secular = "I don't practice any religion although I do belong to one", 
            traditional = "I'm observant and keep some of the practices", 
            religious = "I practice a religion", other = "other")), 
    color = list(originType = "QuestionPage", title = "which color do you like the best", 
        valueDescriptions = list(`1` = "red", `2` = "blue", `3` = "yellow", 
            `4` = "pink", `5` = "orange")), pet = list(originType = "QuestionPage", 
        originIndex = 0L, title = "do you have a pet? ", valueDescriptions = list(
            yes = "yes", no = "no"))))

例如一个变量

假设我要清理df$political 。 要知道要保留哪些值，我将首先 go 到：

var_mapping_list$preferences$political$valueDescriptions

## $republican
## [1] "I have voted most of my life to the republican party"

## $democrat
## [1] "I have voted most of my life to the democratic party"

## $other
## [1] "other"

我的规则是除other之外的所有选项都是df中相应列的有效值。

所以这意味着在df$political中，应该只保留republican和democrat ，并且 rest 应该变成NA 。

因此，仅df$political的示例工作流程将是：

library(tidyr)
library(rlang)
library(dplyr)

vec_political_values_to_keep <-
  var_mapping_list$preferences$political$valueDescriptions %>%
  bind_rows %>%
  pivot_longer(cols = tidyselect::everything(), 
               names_to = "option_key", 
               values_to = "description") %>%
  filter(option_key != "other") %>%
  pull(option_key)
 

df %>% 
  mutate(political = recode(political, !!!rlang::set_names(vec_political_values_to_keep), .default = NA_character_)) ## https://stackoverflow.com/a/63916563/6105259


## # A tibble: 40 x 4
##       id color political  religion_status
##    <int> <int> <chr>      <chr>          
##  1     1     4 republican fzwue3975k     
##  2     2     4 republican mgxoe2220e     
##  3     3     1 democrat   secular        
##  4     4     1 republican secular        
##  5     5     4 NA         oqnfv1461t     
##  6     6     2 NA         ufhua9648n     
##  7     7     1 NA         agnostic       
##  8     8     5 democrat   traditional    
##  9     9     2 republican atheist        
## 10    10     2 NA         nktsl2136o

我想将上述内容扩展到df中任何感兴趣的变量。

所需 Output

指定向量

colnames_to_clean <- c("color", "political", "religion_status")

[1] "color"           "political"       "religion_status"

应返回以下 dataframe：

##       id color political  religion_status
##    <int> <int> <chr>      <chr>          
##  1     1     4 republican NA             
##  2     2     4 republican NA             
##  3     3     1 democrat   secular        
##  4     4     1 republican secular        
##  5     5     4 NA         NA             
##  6     6     2 NA         NA             
##  7     7     1 NA         NA             
##  8     8     5 democrat   traditional    
##  9     9     2 republican NA             
## 10    10     2 NA         NA

对于这方面的任何帮助，我将不胜感激！

Answer 1

这是一种可能性。 首先，您将有效值放在tibble中。

new_list <- tibble(
  name  = names(var_mapping_list$preferences),
  x = var_mapping_list$preferences
) %>%
  mutate(all_vals = map2(x, name, ~ names(.x$valueDescriptions))) %>%
  select(-x)

这样做的好处是您现在可以轻松地使用 tidyverse 中的有效值。 其次，加入有效值并检查当前值是否为有效值：

df %>%
  gather(name, val, -id) %>%
  left_join(new_list, by = "name") %>% 
  group_by(name) %>%
  mutate(val = map2_chr(val, all_vals, ~if_else(.x %in% setdiff(.y, "other"), .x, NA_character_))) %>%
  select(-all_vals) %>%
  spread(name, val)

# A tibble: 40 x 4
      id color political  religion_status
   <int> <chr> <chr>      <chr>          
 1     1 4     republican NA             
 2     2 4     republican NA             
 3     3 1     democrat   secular        
 4     4 1     republican secular        
 5     5 4     NA         NA             
 6     6 2     NA         NA             
 7     7 1     NA         NA             
 8     8 5     democrat   traditional    
 9     9 2     republican NA             
10    10 2     NA         NA             
# ... with 30 more rows

将一些 dataframe 值转换为 NA：要转换的值取决于列，并在单独的列表中给出

问题描述

示例数据

例如一个变量

所需 Output

1 个解决方案

解决方案1
1 已采纳 2020-12-16 17:50:24

将一些 dataframe 值转换为 NA：要转换的值取决于列，并在单独的列表中给出

问题描述

示例数据

例如一个变量

所需 Output

1 个解决方案

解决方案1 1 已采纳 2020-12-16 17:50:24

解决方案1
1 已采纳 2020-12-16 17:50:24