简体   繁体   English

为 R 中的多个变量替换组内的特定 chr 值

[英]Replace specific chr values within groups for multiple variables in R

1. Summarize the problem 1. 总结问题

Hi, I'm relatively new to R and this is my first question on stack overflow but I've been learning from this site for a while.嗨,我对R比较R ,这是我关于堆栈溢出的第一个问题,但我已经从这个站点学习了一段时间。 I found similar questions, but they explain how to remove missing values , work with numerical values or only work for a small number of IDs .我发现了类似的问题,但它们解释了如何删除缺失值、使用数值或仅适用于少量 ID

I have a large data frame (200 000+ rows) where one variable is an alphanumeric ID that represents unique candidates and other variables represent different characteristics.我有一个大数据框(200 000+ 行),其中一个变量是一个字母数字 ID,代表唯一的候选者,其他变量代表不同的特征。 Some candidates are included multiple times in the file, but have different values for the same characteristic.某些候选项在文件中多次包含,但对于同一特征具有不同的值。 I want to resolve these discrepancies to be able to remove duplicates later.我想解决这些差异,以便以后能够删除重复项。 The data structure is similar to this:数据结构类似于:

df <- tibble(ID = c("123abc", "123abc", "123abc", "456def", "456def", "789ghi"),
                 var1 = c("No", "Yes", "No", "No", "No", "No"),
                 var2 = c("No", "No", "No", "Yes", "No", "No"),
                 var3 = c("No", "No", "No", "No", "No", "Yes"))

My goal is to first create sub groups based on ID, then search within each ID to see if they have at least one value of “Yes”, and if so change all their values to “Yes”.我的目标是首先根据 ID 创建子组,然后在每个 ID 中搜索以查看它们是否至少有一个“是”值,如果是,则将所有值更改为“是”。 I want to repeat this for a few variables (var1, var2, var3).我想对几个变量(var1、var2、var3)重复这个。 This is the results that I would like to have:这是我想要的结果:

df <- tibble(ID = c("123abc", "123abc", "123abc", "456def", "456def", "789ghi"),
              var1 = c("Yes", "Yes", "Yes", "No", "No", "No"),
              var2 = c("No", "No", "No", "Yes", "Yes", "No"),
              var3 = c("No", "No", "No", "No", "No", "Yes"))

After this, I will remove duplicate rows to only keep the data that I need.在此之后,我将删除重复的行以仅保留我需要的数据。

df <- distinct(df, across(), .keep_all = TRUE)

2. Describe what you've tried 2. 描述你尝试过的东西

I found partial solutions but I'm having difficulty putting it together.我找到了部分解决方案,但我很难把它放在一起。 I can regroup my data by ID using group_by from dplyr but I'm having issues applying my other functions to the groups:我可以使用dplyr group_by通过 ID 重新dplyr我的数据,但是我在将其他函数应用于组时遇到问题:

df <- df %>% group_by(ID)

I can replace the “No” with “Yes” using the if combined with any , but without the groups, it changes all the values in var1:我可以使用ifany组合将“No”替换为“Yes”,但如果没有组,它会更改 var1 中的所有值:

if(any(df$var1 == "Yes"))
  {  df$var1 = "Yes"  }

The solution I'm trying to create would be similar to Creating loop for slicing the data, loop through the duplicated positions , by using for to loop the IDs and then the variables, but without replacing with random values.我正在尝试创建的解决方案类似于用于切片数据的创建循环,循环遍历重复的位置,通过使用for循环 ID 然后循环变量,但不替换为随机值。

I've promoted my comment to an answer to explain more.我已将我的评论提升为一个答案以解释更多。

First, we need to decide if we want to use dplyr::summarise or dplyr::mutate .首先,我们需要决定是使用dplyr::summarise还是dplyr::mutate summarise makes a single row for every group, whereas mutate leaves the data the same dimensions. summarise为每个组创建一行,而mutate使数据保持相同的维度。

In your example data, all of the rows within each group will be the same after the transformation, so do you really need the duplicates?在您的示例数据中,转换后每个组中的所有行都将相同,所以您真的需要重复吗? Perhaps your real data has other variables, so mutate might make sense.也许您的真实数据还有其他变量,因此mutate可能有意义。

From here, we just need to use dplyr::across to do the same action on each column.从这里开始,我们只需要使用dplyr::across对每一列执行相同的操作。 The first argument is to select the columns, and the second is the function you want to apply.第一个参数是选择列,第二个参数是您要应用的函数。

For mutate, we can use dplyr::ifelse to test if any variable is "Yes" .对于 mutate,我们可以使用dplyr::ifelse来测试是否有任何变量为"Yes" If it is, we can repeat "Yes" as many times as there are rows in that group.如果是,我们可以根据该组中有多少行重复"Yes" Otherwise, we can leave the data alone.否则,我们可以不理会数据。 With across the data is represented by . across数据用 表示. . .

df %>% 
  group_by(ID) %>%
  mutate(across(var1:var3, ~ ifelse(any(. == "Yes"),rep("Yes",length(.)),.)))
# A tibble: 6 x 4
# Groups:   ID [3]
  ID     var1  var2  var3 
  <chr>  <chr> <chr> <chr>
1 123abc Yes   No    No   
2 123abc Yes   No    No   
3 123abc Yes   No    No   
4 456def No    Yes   No   
5 456def No    Yes   No   
6 789ghi No    No    Yes  

If you're willing to use data.table , you can do all of this with lapply .如果你愿意使用data.table ,你可以做到这一切与lapply This is based on @ricardo-saporta's answer to Summarizing multiple columns with data.table .这是基于@ricardo-saporta 对使用 data.table 汇总多列的回答。

library(tibble)
library(data.table)

df <- tibble(ID = c("123abc", "123abc", "123abc", "456def", "456def", "789ghi"),
  var1 = c("No", "Yes", "No", "No", "No", "No"),
  var2 = c("No", "No", "No", "Yes", "No", "No"),
  var3 = c("No", "No", "No", "No", "No", "Yes"))

setDT(df)

any_yes <- function(x) {
  if (any(x == 'Yes')) {
    return('Yes')
  }
  
  'No'
}

df[, lapply(.SD, any_yes), by = ID]

I more way which I have learnt from dear @akrun which obviates the need of usage ifelse我从亲爱的@akrun那里学到了更多的方法,它消除了使用ifelse的需要

library(dplyr)

df %>% 
  group_by(ID) %>%
  mutate(across(var1:var3, ~  c('No', 'Yes')[1 + as.logical(sum(. == 'Yes'))]))

#> # A tibble: 6 x 4
#> # Groups:   ID [3]
#>   ID     var1  var2  var3 
#>   <chr>  <chr> <chr> <chr>
#> 1 123abc Yes   No    No   
#> 2 123abc Yes   No    No   
#> 3 123abc Yes   No    No   
#> 4 456def No    Yes   No   
#> 5 456def No    Yes   No   
#> 6 789ghi No    No    Yes

Created on 2021-06-19 by the reprex package (v2.0.0)reprex 包( v2.0.0 ) 于 2021 年 6 月 19 日创建

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM