繁体   English   中英

从 R 中的字符串向量匹配单词

[英]Matching words from vectors of strings in R

我正在尝试通过将混乱的站点名称列表与批准的列表进行匹配来清理数据库。

例如,首选站点名称可能是“Cotswold Water Park Pit 28”,但该站点已输入数据库:“Pit 28”、“28”、“CWP Pit 28”和“Cotswold 28”。

数据看起来像这样:

approved <- c("Cotswold Water Park Pit 28", "Cotswold Water Park Pit 14", "Robinswood Hill")

messy <- c("Pit 28", "28", "CWP Pit 28", "Cotswold 28", "14", "Robinswood")

我正在寻找一种方法来匹配每个元素中的单词/数字(非空格字符的簇),与每个元素中的messy /数字相approved 理想情况下,我会得到这样的结果:

     Cotswold Water Park Pit 28 Cotswold Water Park Pit 14 Robinswood Hill
[1,] "Pit 28"                   "Pit 28"                   "Robinswood"   
[2,] "28"                       "CWP Pit 28"               NA             
[3,] "CWP Pit 28"               "14"                       NA             
[4,] "Cotswold 28"              NA                         NA   

approved的元素形成列名,并且任何来自messy的包含匹配单词/数字的元素出现在该列的单元格中。 我知道会有一些错误的匹配。 这很好,我可以稍后手动过滤它们,并且可能会从模式匹配中排除“森林”和“山丘”等常用词。

通过使用regex将每个元素拆分成messy的状态,我已经能够通过上述示例数据获得我想要的结果,但是我正在处理站点名称列表中的单词/数字列表,我不得不使用嵌套循环或sapply以将它们与已批准的元素匹配,因为像grepgreplstr_detect这样的函数只允许一种模式。 由于数据库很大,当我将它应用于整个事物时,这需要很长时间。 我真正想要的是一个 function :

match(any word in approved[1], any word in messy[1])

要么给我一个TRUE FALSE output 要么提取 messy messy[1]如果它匹配会很棒!

也许您正在寻找adist

x <- adist(messy, approved, fixed=FALSE, ignore.case = TRUE)
y <- t(adist(approved, messy, fixed=FALSE, ignore.case = TRUE))
i <- x == apply(x, 1, min)
y[!i]  <- NA
colnames(y) <- approved
i <- apply(y == apply(y, 1, min, na.rm=TRUE), 2, function(i) messy[i & !is.na(i)])
do.call(cbind, lapply(i, function(x) x[seq_len(max(lengths(i)))]))
#     Cotswold Water Park Pit 28 Cotswold Water Park Pit 14 Robinswood Hill
#[1,] "Pit 28"                   "14"                       "Robinswood"   
#[2,] "28"                       NA                         NA             
#[3,] "CWP Pit 28"               NA                         NA             
#[4,] "Cotswold 28"              NA                         NA             

基本 R 选项将是:

result <- sapply(approved, function(x) grep(gsub('\\s+', '|', x), messy, value = TRUE))
result
#$`Cotswold Water Park Pit 28`
#[1] "Pit 28"      "28"          "CWP Pit 28"  "Cotswold 28"

#$`Cotswold Water Park Pit 14`
#[1] "Pit 28"      "CWP Pit 28"  "Cotswold 28" "14"         

#$`Robinswood Hill`
#[1] "Robinswood"

这里的逻辑是我们在approved的每个空白处插入 pipe ( | ) 符号,如果任何单词匹配,则返回messy的单词。

要获得与所示格式相同的 output,我们可以执行以下操作:

sapply(result, `[`, 1:max(lengths(result)))

#     Cotswold Water Park Pit 28 Cotswold Water Park Pit 14 Robinswood Hill
#[1,] "Pit 28"                   "Pit 28"                   "Robinswood"   
#[2,] "28"                       "CWP Pit 28"               NA             
#[3,] "CWP Pit 28"               "Cotswold 28"              NA             
#[4,] "Cotswold 28"              "14"                       NA   

tidyverse/tidytext 解决方案

首先将它们变成数据框

require(tidyverse) 
require(tidytext)


## create dataframe for approved 

approved <- c("Cotswold Water Park Pit 28", "Cotswold Water Park Pit 14", "Robinswood Hill")


## create dataframe for messy 

messy <- c("Pit 28", "28", "CWP Pit 28", "Cotswold 28", "14", "Robinswood")

然后使用 tidytext 将它们拆分为 1 个单词 = 1 行,我喜欢在行数发生变化时添加 ID...

## split into words 

approved_df <- 
tibble(approved = approved) %>%  
  rownames_to_column('approved_id') %>% 
  unnest_tokens(words, approved, 'words', drop = FALSE)

approved_df %>%  head 

# A tibble: 6 x 3
# approved_id approved                   words   
# <chr>       <chr>                      <chr>   
# 1 1           Cotswold Water Park Pit 28 cotswold
# 2 1           Cotswold Water Park Pit 28 water   
# 3 1           Cotswold Water Park Pit 28 park    
# 4 1           Cotswold Water Park Pit 28 pit     
# 5 1           Cotswold Water Park Pit 28 28      
# 6 2           Cotswold Water Park Pit 14 cotswold
    
messy_df <- 
tibble(messy = messy) %>%  
  rownames_to_column('messy_id') %>% 
  unnest_tokens(words, messy, 'words', drop = FALSE)

messy_df %>%  head          
# # A tibble: 6 x 3
# messy_id messy      words
# <chr>    <chr>      <chr>
# 1 1        Pit 28     pit  
# 2 1        Pit 28     28   
# 3 2        28         28   
# 4 3        CWP Pit 28 cwp  
# 5 3        CWP Pit 28 pit  
# 6 3        CWP Pit 28 28   

最后,在单词级别加入两个数据框,计算重叠中有多少单词,然后为每个“混乱”字符串分配一个“已批准”

     ## join the data sets and rank by the number of words in the overlap
  
  messy_df %>%  left_join(approved_df) %>%  
    group_by(messy, messy_id, approved, approved_id) %>%  
    summarise(n_row = n()) %>%  
    ungroup %>%  
    group_by(messy, messy_id) %>%  
    mutate(approved_rank = rank(desc(n_row))) %>%  
    ungroup %>%  
    filter(approved_rank == 1) %>%  
    arrange(messy_id)



  # Joining, by = "words"
  # # A tibble: 6 x 6
  # messy       messy_id approved                   approved_id n_row approved_rank
  # <chr>       <chr>    <chr>                      <chr>       <int>         <dbl>
  # 1 Pit 28      1        Cotswold Water Park Pit 28 1               2             1
  # 2 28          2        Cotswold Water Park Pit 28 1               1             1
  # 3 CWP Pit 28  3        Cotswold Water Park Pit 28 1               2             1
  # 4 Cotswold 28 4        Cotswold Water Park Pit 28 1               2             1
  # 5 14          5        Cotswold Water Park Pit 14 2               1             1
  # 6 Robinswood  6        Robinswood Hill            3               1             1

这是一个高度灵活的 regex_join 解决方案

library( fuzzyjoin )
library( data.table )
#make data.frames
messy.df <- data.frame( messy ); approved.df <- data.frame( approved )
#create regexes
messy.df$regex <- gsub( " ", "|", messy.df$messy )
#regex join
ans <- regex_full_join( approved.df, messy.df, by = c("approved" = "regex") )
#cast to wide
dcast( setDT(ans), messy~approved, value.var = "messy")[, -1]

#      Cotswold Water Park Pit 14 Cotswold Water Park Pit 28 Robinswood Hill
#   1:                         14                       <NA>            <NA>
#   2:                       <NA>                         28            <NA>
#   3:                 CWP Pit 28                 CWP Pit 28            <NA>
#   4:                Cotswold 28                Cotswold 28            <NA>
#   5:                     Pit 28                     Pit 28            <NA>
#   6:                       <NA>                       <NA>      Robinswood

这是使用stringi的一种可能性(它比stringr更快,并且通常比基本 R 正则表达式操作更快。当您具有可变长度时,此解决方案返回一个应该比矩阵更有效的列表。

library(stringi)
messy_ors <- stri_replace_all(messy, " ", "|")  
lapply(approved, function(x) messy[stri_detect(x, regex = messy_ors)]) 

$`Cotswold Water Park Pit 28`
[1] "Pit 28"      "28"          "CWP Pit 28"  "Cotswold 28"

$`Cotswold Water Park Pit 14`
[1] "Pit 28"      "CWP Pit 28"  "Cotswold 28" "14"         

$`Robinswood Hill`
[1] "Robinswood"

如果你真的需要一个矩阵,你可以将 output 转换为:

n <- max(lengths(out))
sapply(out, function(x) x[1:n])

我不确定我在下面的尝试是否符合您的目的

res <- within(
  expand.grid(messy, approved),
  matched <- do.call(
    function(...) lengths(mapply(intersect, ...)) > 0,
    unname(expand.grid(strsplit(messy, " "), strsplit(approved, " ")))
  )
)

给予

          Var1                       Var2 matched
1       Pit 28 Cotswold Water Park Pit 28    TRUE
2           28 Cotswold Water Park Pit 28    TRUE
3   CWP Pit 28 Cotswold Water Park Pit 28    TRUE
4  Cotswold 28 Cotswold Water Park Pit 28    TRUE
5           14 Cotswold Water Park Pit 28   FALSE
6   Robinswood Cotswold Water Park Pit 28   FALSE
7       Pit 28 Cotswold Water Park Pit 14    TRUE
8           28 Cotswold Water Park Pit 14   FALSE
9   CWP Pit 28 Cotswold Water Park Pit 14    TRUE
10 Cotswold 28 Cotswold Water Park Pit 14    TRUE
11          14 Cotswold Water Park Pit 14    TRUE
12  Robinswood Cotswold Water Park Pit 14   FALSE
13      Pit 28            Robinswood Hill   FALSE
14          28            Robinswood Hill   FALSE
15  CWP Pit 28            Robinswood Hill   FALSE
16 Cotswold 28            Robinswood Hill   FALSE
17          14            Robinswood Hill   FALSE
18  Robinswood            Robinswood Hill    TRUE

如果您想获得帖子中显示的 output,您可以在res上进一步玩一些技巧,例如,

res2 <- do.call(
  cbind,
  lapply(
    u <- with(subset(res, matched), split(Var1, Var2)),
    function(x) `length<-`(as.vector(x), max(lengths(u)))
  )
)

这样

> res2
     Cotswold Water Park Pit 28 Cotswold Water Park Pit 14 Robinswood Hill
[1,] "Pit 28"                   "Pit 28"                   "Robinswood"
[2,] "28"                       "CWP Pit 28"               NA
[3,] "CWP Pit 28"               "Cotswold 28"              NA
[4,] "Cotswold 28"              "14"                       NA

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM