简体   繁体   English

在多列上使用 grep 在 R 中创建新变量

[英]Using grep on multiple columns to create new variable in R

I'm trying to run grep across multiple columns to create a new binary variable in my dataset.我正在尝试跨多个列运行 grep 以在我的数据集中创建一个新的二进制变量。 I can't share my real dataset, but I've created a sample one to demonstrate my issue:我无法分享我的真实数据集,但我创建了一个示例来演示我的问题:

breakfast <- c("apple orange", "orange banana", "apple")
lunch <- c("orange", "apple orange", "apple banana")
df <- data.frame(breakfast, lunch)

In this example, my goal is to create a new binary variable in this dataframe called "apple" that is 1 if either the "breakfast" or "lunch" columns contain "apple" and 0 if they do not.在此示例中,我的目标是在此 dataframe 中创建一个名为“apple”的新二进制变量,如果“breakfast”或“lunch”列包含“apple”则为 1,否则为 0。

I can achieve this by using nested ifelse statements and grepl:我可以通过使用嵌套的 ifelse 语句和 grepl 来实现这一点:

df$apple <- ifelse(grepl("apple", df$breakfast), 1,
            ifelse(grepl("apple", df$lunch), 1, 0))

In my real dataset though, I need to scan more than just two columns and repeat the process for multiple strings, so I'm hoping to create a function that will run it through the columns for me.不过,在我的真实数据集中,我需要扫描的不仅仅是两列,并对多个字符串重复该过程,所以我希望创建一个 function 来为我运行它。 What's the best way to do this?最好的方法是什么?

I've found several posts that address similar questions, but many of them are based on variables with single values to match to rather than concatenated strings (== "apple" rather than contains "apple").我发现了几篇解决类似问题的帖子,但其中许多都是基于具有单个值的变量来匹配而不是连接字符串(==“apple”而不是包含“apple”)。 I'm also struggling with how to adapt existing examples to then create the binary variable I'm looking for.我还在努力研究如何调整现有示例以创建我正在寻找的二进制变量。

A general solution would be to (i) define a vector with all possible fruits一个通用的解决方案是(i)定义一个包含所有可能结果的fruits

fruits <- c("apple", "orange", "banana", "lemon")

and (ii) to run a for loop that detects whether each fruit token is present in each of the columns and that creates for each fruit type a new column: (ii) 运行一个for循环,检测每个fruit标记是否存在于每个列中,并为每个fruit类型创建一个新列:

library(stringr)
for(i in fruits){
  df[i] <- +str_detect(apply(df, 1, paste0, collapse = " "), fruits[which(fruits == i)])
}
df
      breakfast        lunch apple orange banana lemon
1  apple orange       orange     1      1      0     0
2 orange banana apple orange     1      1      1     0
3         apple apple banana     1      0      1     0 

For more solutions see Detecting key words across multiple columns and flagging them each in new columns有关更多解决方案,请参阅跨多个列检测关键字并在新列中分别标记它们

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM