与 r 中的 dataframe 列的精确匹配文本

Question

I have a vector of words in R:我在 R 中有一个单词向量：

words = c("Awesome","Loss","Good","Bad")

And I have the following dataframe in R:我在 R 中有以下 dataframe：

df <- data.frame(ID = c(1,2,3),
                 Response = c("Today is an awesome day", 
                              "Yesterday was a bad day,but today it is good",
                              "I have losses today"))

What I want to do is words that are exact matching in Response column should be extracted and inserted into new column in dataframe.我想要做的是在响应列中完全匹配的单词应该被提取并插入到 dataframe 的新列中。 Final output should look like this最终的 output 应该是这样的

ID           Response                        Match          
1            Today is an awesome day        Awesome           
2            Yesterday was a bad day        Bad,Good           
             ,but today it is good      
3            I have losses today            NA

I used the following code:我使用了以下代码：

extract the list of matching words提取匹配词列表

x <- sapply(words, function(x) grepl(tolower(x), tolower(df$Response)))

paste the matching words together将匹配的单词粘贴在一起

df$Words <- apply(x, 1, function(i) paste0(names(i)[i], collapse = ","))

But it is providing the match, but not the exact.但它提供了匹配，但不是确切的。 Please help.请帮忙。

Answer 1

If you use anchors in your words vector, you will ensure exact matches: ^ asserts that you're at the start, $ that you're at the end of a word.如果您在words向量中使用锚点，您将确保完全匹配： ^ 断言您在开头， $ 断言您在单词结尾。 So:所以：

words = c("Awesome","^Loss$","Good","Bad")

Then use your code:然后使用您的代码：

x <- sapply(words, function(x) grepl(tolower(x), tolower(df$Response)))
df$Words <- apply(x, 1, function(i) paste0(names(i)[i], collapse = ","))

which gives:这使：

> df
  ID                                     Response    Words
1  1                      Today is an awesome day  Awesome
2  2 Yesterday was a bad day,but today it is good Good,Bad
3  3                          I have losses today

To turn blanks to NA :将空白变为NA ：

df$Words[df$Words == ""] <- NA

Answer 2

We can use str_extract_all我们可以使用str_extract_all

library(stringr)
library(dplyr)
library(purrr)
df %>%
    mutate(Words = map_chr(str_extract_all(Response, str_c("
       (?i)\\b(", str_c(words, collapse="|"), ")\\b")), toString))
#   ID                                     Response     Words
#1  1                      Today is an awesome day   awesome
#2  2 Yesterday was a bad day,but today it is good bad, good
#3  3                          I have losses today

data数据

words <- c("Awesome","Loss","Good","Bad")

Answer 3

Change the first *apply function to a two lines function.将第一个*apply function 更改为两行 function。 If the regex becomes "\\bword\\b" then it captures the word surrounded by boundaries.如果正则表达式变为"\\bword\\b" ，那么它会捕获由边界包围的单词。

x <- sapply(words, function(x) {
  y <- paste0("\\b", x, "\\b")
  grepl(tolower(y), tolower(df$Response))
})

Now run the second apply as posted in the question.现在运行问题中发布的第二个apply程序。

df$Words <- apply(x, 1, function(i) paste0(names(i)[i], collapse = ","))

df
#  ID                                     Response    Words
#1  1                      Today is an awesome day  Awesome
#2  2 Yesterday was a bad day,but today it is good Good,Bad
#3  3                          I have losses today

As for the NA 's, I will use function is.na<- .至于NA ，我将使用 function is.na<- 。

is.na(df$Words) <- df$Words == ""

Data.数据。

df <- read.table(text = "
ID           Response
1            'Today is an awesome day'
2            'Yesterday was a bad day,but today it is good'
3            'I have losses today'
", header = TRUE)

words <- c("Awesome","Loss","Good","Bad")

与 r 中的 dataframe 列的精确匹配文本

问题描述

extract the list of matching words提取匹配词列表

paste the matching words together将匹配的单词粘贴在一起

3 个解决方案

解决方案1
0 2020-04-11 17:13:53

解决方案2
0 2020-04-11 17:22:07

data数据

解决方案3
0 2020-04-11 17:23:18

与 r 中的 dataframe 列的精确匹配文本

问题描述

extract the list of matching words提取匹配词列表

paste the matching words together将匹配的单词粘贴在一起

3 个解决方案

解决方案1 0 2020-04-11 17:13:53

解决方案2 0 2020-04-11 17:22:07

data数据

解决方案3 0 2020-04-11 17:23:18

解决方案1
0 2020-04-11 17:13:53

解决方案2
0 2020-04-11 17:22:07

解决方案3
0 2020-04-11 17:23:18