在R中添加具有两个不同列（按行）的匹配单词百分比的列

Question

我有一个tbl_df并希望看到两个字符串之间匹配单词的百分比。

数据如下所示：

# A tibble 3 x 2
       X                 Y
     <chr>             <chr>
1 "mary smith"      "mary smith"
2 "mary smith"      "john smith"
3 "mike williams"   "jack johnson"

期望的输出（按行以任意顺序排列）：

# A tibble 3 x 3 
       X               Y           Z 
     <chr>           <chr>        <dbl>
1 "mary smith"    "mary smith"     1.0 
2 "mary smith"    "john smith"     0.50 
3 "mike williams" "jack johnson"   0.0

Answer 1

甲base R的选择是，以检查length的公共字（ intesect之后） split由空间婷列和划分length

df1$Z <- mapply(function(x, y)  length(intersect(x, y))/length(x), 
            strsplit(df1$X, " "), strsplit(df1$Y, " "))
df$Z
#[1] 1.0 0.5 0.0

或者在tidyverse ，我们可以使用map2并应用相同的逻辑

library(tidyverse)
df1 %>% 
  mutate(Z = map2(strsplit(X, " "), strsplit(Y, " "), ~ 
                       length(intersect(.x, .y))/length(.x)))
 #             X            Y   Z
#1    mary smith   mary smith   1
#2    mary smith   john smith 0.5
#3 mike williams jack johnson   0

数据

df1 <- structure(list(X = c("mary smith", "mary smith", "mike williams"
), Y = c("mary smith", "john smith", "jack johnson")), .Names = c("X", 
"Y"), class = "data.frame", row.names = c("1", "2", "3"))

Answer 2

这是使用stringr::str_split的tidyverse选项

library(dplyr)
library(stringr)
df %>%
    mutate(Z = map2(str_split(X, " "), str_split(Y, " "), ~sum(.x == .y) / length(.x)))
#              X            Y   Z
#1    mary smith   mary smith   1
#2    mary smith   john smith 0.5
#3 mike williams jack johnson   0

或使用stringi::stri_extract_all_words

library(stringi)
df %>%
    mutate(Z = map2(stri_extract_all_words(X), stri_extract_all_words(Y), ~sum(.x == .y) / length(.x)))

样本数据

df <- read.table(text =
    '       X                 Y
 "mary smith"      "mary smith"
 "mary smith"      "john smith"
 "mike williams"   "jack johnson"', header = T)

Answer 3

尝试在stringdist包中使用stringsim() ：

library(stringdist)

tbl <- tibble(x = c("mary smith", "mary smith", "mike williams"),
              y = c("mary smith", "john smith", "jack johnson"))

# lv = levenshtein distance
tbl %>% mutate(z = stringsim(x, y, method ='lv'))

# jw =  jaro-winkler 
tbl %>% mutate(z = stringsim(x, y, method ='jw'))

## > tbl %>% mutate(z = stringsim(x, y, method ='lv'))
## # A tibble: 3 x 3
##  x             y                 z
##  <chr>         <chr>         <dbl>
## 1 mary smith    mary smith   1.00  
## 2 mary smith    john smith   0.600 
## 3 mike williams jack johnson 0.0769

## > tbl %>% mutate(z = stringsim(x, y, method ='jw'))
## # A tibble: 3 x 3
##   x             y                z
##  <chr>         <chr>        <dbl>
## 1 mary smith    mary smith   1.00 
## 2 mary smith    john smith   0.733
## 3 mike williams jack johnson 0.494

在R中添加具有两个不同列（按行）的匹配单词百分比的列

问题描述

3 个解决方案

解决方案1
4 已采纳 2018-08-09 13:25:23

数据

解决方案2
2 2018-08-09 13:26:18

样本数据

解决方案3
1 2018-08-09 13:32:03

在R中添加具有两个不同列（按行）的匹配单词百分比的列

问题描述

3 个解决方案

解决方案1 4 已采纳 2018-08-09 13:25:23

数据

解决方案2 2 2018-08-09 13:26:18

样本数据

解决方案3 1 2018-08-09 13:32:03

解决方案1
4 已采纳 2018-08-09 13:25:23

解决方案2
2 2018-08-09 13:26:18

解决方案3
1 2018-08-09 13:32:03