[英]Add column with percentage of matching words in two different columns (by row) in R
我有一个tbl_df并希望看到两个字符串之间匹配单词的百分比。
数据如下所示:
# A tibble 3 x 2
X Y
<chr> <chr>
1 "mary smith" "mary smith"
2 "mary smith" "john smith"
3 "mike williams" "jack johnson"
期望的输出(按行以任意顺序排列):
# A tibble 3 x 3
X Y Z
<chr> <chr> <dbl>
1 "mary smith" "mary smith" 1.0
2 "mary smith" "john smith" 0.50
3 "mike williams" "jack johnson" 0.0
甲base R
的选择是,以检查length
的公共字( intesect
之后) split
由空间婷列和划分length
df1$Z <- mapply(function(x, y) length(intersect(x, y))/length(x),
strsplit(df1$X, " "), strsplit(df1$Y, " "))
df$Z
#[1] 1.0 0.5 0.0
或者在tidyverse
,我们可以使用map2
并应用相同的逻辑
library(tidyverse)
df1 %>%
mutate(Z = map2(strsplit(X, " "), strsplit(Y, " "), ~
length(intersect(.x, .y))/length(.x)))
# X Y Z
#1 mary smith mary smith 1
#2 mary smith john smith 0.5
#3 mike williams jack johnson 0
df1 <- structure(list(X = c("mary smith", "mary smith", "mike williams"
), Y = c("mary smith", "john smith", "jack johnson")), .Names = c("X",
"Y"), class = "data.frame", row.names = c("1", "2", "3"))
这是使用stringr::str_split
的tidyverse
选项
library(dplyr)
library(stringr)
df %>%
mutate(Z = map2(str_split(X, " "), str_split(Y, " "), ~sum(.x == .y) / length(.x)))
# X Y Z
#1 mary smith mary smith 1
#2 mary smith john smith 0.5
#3 mike williams jack johnson 0
或使用stringi::stri_extract_all_words
library(stringi)
df %>%
mutate(Z = map2(stri_extract_all_words(X), stri_extract_all_words(Y), ~sum(.x == .y) / length(.x)))
df <- read.table(text =
' X Y
"mary smith" "mary smith"
"mary smith" "john smith"
"mike williams" "jack johnson"', header = T)
尝试在stringdist
包中使用stringsim()
:
library(stringdist)
tbl <- tibble(x = c("mary smith", "mary smith", "mike williams"),
y = c("mary smith", "john smith", "jack johnson"))
# lv = levenshtein distance
tbl %>% mutate(z = stringsim(x, y, method ='lv'))
# jw = jaro-winkler
tbl %>% mutate(z = stringsim(x, y, method ='jw'))
## > tbl %>% mutate(z = stringsim(x, y, method ='lv'))
## # A tibble: 3 x 3
## x y z
## <chr> <chr> <dbl>
## 1 mary smith mary smith 1.00
## 2 mary smith john smith 0.600
## 3 mike williams jack johnson 0.0769
## > tbl %>% mutate(z = stringsim(x, y, method ='jw'))
## # A tibble: 3 x 3
## x y z
## <chr> <chr> <dbl>
## 1 mary smith mary smith 1.00
## 2 mary smith john smith 0.733
## 3 mike williams jack johnson 0.494
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.