用R進行字符串匹配：尋找最佳匹配

Question

我有兩個詞向量。

Corpus<- c('animalada', 'fe', 'fernandez', 'ladrillo')

Lexicon<- c('animal', 'animalada', 'fe', 'fernandez', 'ladr', 'ladrillo')

我需要在詞典和語料庫之間進行最佳匹配。 我嘗試了很多方法。 這就是其中之一。

library(stringr)

match<- paste(Lexicon,collapse= '|^') # I use the stemming method (snowball), so the words in Lexicon are root of words

test<- str_extrac_all (Corpus,match,simplify= T)

test

[,1]
[1,] "animal"
[2,] "fe"
[3,] "fe"
[4,] "ladr"

但是，匹配項應為：

[1,] "animalada"
[2,] "fe"
[3,] "fernandez"
[1,] "ladrillo"

相反，該匹配項與我的詞典中按字母順序排列的第一個單詞匹配。 順便說一下，這些向量是我所擁有的一個更大列表的樣本。

我沒有嘗試使用regex（），因為我不確定它是如何工作的。 也許解決方案就是這樣。

您能幫我解決這個問題嗎？ 謝謝您的幫助。

Answer 1

您可以只使用match功能。

Index <- match(Corpus, Lexicon)

Index
[1] 2 3 4 6

Lexicon[Index]
[1] "animalada"  "fe"   "fernandez"  "ladrillo"

Answer 2

您可以按模式具有的字符數（按降序排列）對Lexicon進行排序，因此最佳匹配排在第一位：

match<- paste(Lexicon[order(-nchar(Lexicon))], collapse = '|^')

test<- str_extract_all(Corpus, match, simplify= T)

test
#     [,1]       
#[1,] "animalada"
#[2,] "fe"       
#[3,] "fernandez"
#[4,] "ladrillo"

Answer 3

我嘗試了兩種方法，正確的方法是@Psidorm建議的。 如果使用功能match()則會在單詞的任何部分找到匹配項，而不必在開頭。 例如：

Corpus<- c('tambien')
Lexicon<- c('bien')
match(Corpus,Lexicon)

結果是“ tambien”，但這是不正確的。

再次感謝你們的幫助！

用R進行字符串匹配：尋找最佳匹配

問題描述

3 個解決方案

解決方案1
1 2017-09-23 01:59:20

解決方案2
0 2017-09-23 01:54:24

解決方案3
0 2017-09-27 03:16:36

用R進行字符串匹配：尋找最佳匹配

問題描述

3 個解決方案

解決方案1 1 2017-09-23 01:59:20

解決方案2 0 2017-09-23 01:54:24

解決方案3 0 2017-09-27 03:16:36

解決方案1
1 2017-09-23 01:59:20

解決方案2
0 2017-09-23 01:54:24

解決方案3
0 2017-09-27 03:16:36