繁体   English   中英

在R中添加具有两个不同列(按行)的匹配单词百分比的列

[英]Add column with percentage of matching words in two different columns (by row) in R

我有一个tbl_df并希望看到两个字符串之间匹配单词的百分比。

数据如下所示:

# A tibble 3 x 2
       X                 Y
     <chr>             <chr>
1 "mary smith"      "mary smith"
2 "mary smith"      "john smith"
3 "mike williams"   "jack johnson"

期望的输出(按行以任意顺序排列):

# A tibble 3 x 3 
       X               Y           Z 
     <chr>           <chr>        <dbl>
1 "mary smith"    "mary smith"     1.0 
2 "mary smith"    "john smith"     0.50 
3 "mike williams" "jack johnson"   0.0

base R的选择是,以检查length的公共字( intesect之后) split由空间婷列和划分length

df1$Z <- mapply(function(x, y)  length(intersect(x, y))/length(x), 
            strsplit(df1$X, " "), strsplit(df1$Y, " "))
df$Z
#[1] 1.0 0.5 0.0

或者在tidyverse ,我们可以使用map2并应用相同的逻辑

library(tidyverse)
df1 %>% 
  mutate(Z = map2(strsplit(X, " "), strsplit(Y, " "), ~ 
                       length(intersect(.x, .y))/length(.x)))
 #             X            Y   Z
#1    mary smith   mary smith   1
#2    mary smith   john smith 0.5
#3 mike williams jack johnson   0

数据

df1 <- structure(list(X = c("mary smith", "mary smith", "mike williams"
), Y = c("mary smith", "john smith", "jack johnson")), .Names = c("X", 
"Y"), class = "data.frame", row.names = c("1", "2", "3"))

这是使用stringr::str_splittidyverse选项

library(dplyr)
library(stringr)
df %>%
    mutate(Z = map2(str_split(X, " "), str_split(Y, " "), ~sum(.x == .y) / length(.x)))
#              X            Y   Z
#1    mary smith   mary smith   1
#2    mary smith   john smith 0.5
#3 mike williams jack johnson   0

或使用stringi::stri_extract_all_words

library(stringi)
df %>%
    mutate(Z = map2(stri_extract_all_words(X), stri_extract_all_words(Y), ~sum(.x == .y) / length(.x)))

样本数据

df <- read.table(text =
    '       X                 Y
 "mary smith"      "mary smith"
 "mary smith"      "john smith"
 "mike williams"   "jack johnson"', header = T)

尝试在stringdist包中使用stringsim()

library(stringdist)

tbl <- tibble(x = c("mary smith", "mary smith", "mike williams"),
              y = c("mary smith", "john smith", "jack johnson"))

# lv = levenshtein distance
tbl %>% mutate(z = stringsim(x, y, method ='lv'))

# jw =  jaro-winkler 
tbl %>% mutate(z = stringsim(x, y, method ='jw'))

## > tbl %>% mutate(z = stringsim(x, y, method ='lv'))
## # A tibble: 3 x 3
##  x             y                 z
##  <chr>         <chr>         <dbl>
## 1 mary smith    mary smith   1.00  
## 2 mary smith    john smith   0.600 
## 3 mike williams jack johnson 0.0769

## > tbl %>% mutate(z = stringsim(x, y, method ='jw'))
## # A tibble: 3 x 3
##   x             y                z
##  <chr>         <chr>        <dbl>
## 1 mary smith    mary smith   1.00 
## 2 mary smith    john smith   0.733
## 3 mike williams jack johnson 0.494

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM