[英]Exact Matching text with dataframe column in r
我在 R 中有一個單詞向量:
words = c("Awesome","Loss","Good","Bad")
我在 R 中有以下 dataframe:
df <- data.frame(ID = c(1,2,3),
Response = c("Today is an awesome day",
"Yesterday was a bad day,but today it is good",
"I have losses today"))
我想要做的是在響應列中完全匹配的單詞應該被提取並插入到 dataframe 的新列中。 最終的 output 應該是這樣的
ID Response Match
1 Today is an awesome day Awesome
2 Yesterday was a bad day Bad,Good
,but today it is good
3 I have losses today NA
我使用了以下代碼:
x <- sapply(words, function(x) grepl(tolower(x), tolower(df$Response)))
df$Words <- apply(x, 1, function(i) paste0(names(i)[i], collapse = ","))
但它提供了匹配,但不是確切的。 請幫忙。
如果您在words
向量中使用錨點,您將確保完全匹配: ^ 斷言您在開頭, $ 斷言您在單詞結尾。 所以:
words = c("Awesome","^Loss$","Good","Bad")
然后使用您的代碼:
x <- sapply(words, function(x) grepl(tolower(x), tolower(df$Response)))
df$Words <- apply(x, 1, function(i) paste0(names(i)[i], collapse = ","))
這使:
> df
ID Response Words
1 1 Today is an awesome day Awesome
2 2 Yesterday was a bad day,but today it is good Good,Bad
3 3 I have losses today
將空白變為NA
:
df$Words[df$Words == ""] <- NA
我們可以使用str_extract_all
library(stringr)
library(dplyr)
library(purrr)
df %>%
mutate(Words = map_chr(str_extract_all(Response, str_c("
(?i)\\b(", str_c(words, collapse="|"), ")\\b")), toString))
# ID Response Words
#1 1 Today is an awesome day awesome
#2 2 Yesterday was a bad day,but today it is good bad, good
#3 3 I have losses today
words <- c("Awesome","Loss","Good","Bad")
將第一個*apply
function 更改為兩行 function。 如果正則表達式變為"\\bword\\b"
,那么它會捕獲由邊界包圍的單詞。
x <- sapply(words, function(x) {
y <- paste0("\\b", x, "\\b")
grepl(tolower(y), tolower(df$Response))
})
現在運行問題中發布的第二個apply
程序。
df$Words <- apply(x, 1, function(i) paste0(names(i)[i], collapse = ","))
df
# ID Response Words
#1 1 Today is an awesome day Awesome
#2 2 Yesterday was a bad day,but today it is good Good,Bad
#3 3 I have losses today
至於NA
,我將使用 function is.na<-
。
is.na(df$Words) <- df$Words == ""
數據。
df <- read.table(text = "
ID Response
1 'Today is an awesome day'
2 'Yesterday was a bad day,but today it is good'
3 'I have losses today'
", header = TRUE)
words <- c("Awesome","Loss","Good","Bad")
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.