字符串关键词匹配

Question

在levenstein how are you ， hw ru ， how are u和hw ar you可以比较为相同，

反正我能做到这一点

如果我有一个短语。

短语

嗨，我叫约翰·多伊。 我住在纽约。 你叫什么名字？

短语

我叫布鲁斯。 你叫什么名字

关键短语

你叫什么名字

响应

我叫蝙蝠侠。

我从用户那里得到输入。我有一张桌子，上面列出了可能的请求和响应。 例如，用户将询问“其名称”，有没有一种方法可以检查句子中是否包含关键短语（如What is your name是What is your name ，如果找到它，它将返回可能的响应

喜欢

phrase = ' hi, my name is john doe. I live in new york. What is your name?'

//I know this one will work
if (strpos($phrase,"What is your name") !== false) {
    return $response;
}

//but what if the user mistype it 
if (strpos($phrase,"Wht's your name") !== false) {
    return $response;
}

我有办法实现这一目标吗？ 只有在字符串的长度与比较的字符串不那么长的情况下，levenstein才可以完美地工作。

喜欢

嗨，你叫什么名字

我叫蝙蝠侠。

但是如果这么久

嗨，我叫约翰·多伊。 我住在纽约。 你叫什么名字？

其效果不佳。 如果有较短的短语，它将识别距离较短的较短的短语并返回错误的响应

我在想另一种方法是检查一些关键短语。 那么有什么想法可以实现这一目标吗？

我当时正在做这样的事情，但也许我想有更好更好的方法

$samplePhrase = 'hi, im spongebob, i work at krabby patty. i love patties. Whts your name my friend';

$keyPhrase = 'What is your name';

获取keyPhrase第一个字符。 那将是“ W”迭代
$samplePhrase字符并与keyPhrase第一个字符进行keyPhrase
h,i, ,i,m, ,s,p等。 。
如果keyPhrase.char = samplePhrase.currentChar
获取keyPhrase.length
获取samplePhrase.currentChar索引
根据keyPhrase.length的currentChar索引获取samplePhrase的子字符串
首先会work at krabby pa
使用levenstiens距离将work at krabby pa与$ keyPhrase（“您叫什么名字”）进行比较
为了更好地检查它，请使用semilar_text。 11.如果不相等，则距离要大重复过程。

Answer 1

我的建议是从关键字生成n-gram列表，并计算每个n-gram与关键字之间的编辑距离。

例：

key phrase: "What is your name"
phrase 1: "hi, my name is john doe. I live in new york. What is your name?"
phrase 2: "My name is Bruce. wht's your name"

可能的匹配n-gram的长度在3到4个单词之间，因此，我们为每个短语创建了所有3-gram和4-gram，我们还应该通过删除标点符号和小写形式将字符串标准化。

phrase 1 3-grams:
"hi my name", "my name is", "name is john", "is john doe", "john doe I", "doe I live"... "what is your", "is your name"
phrase 1 4-grams:
"hi my name is", "my name is john doe", "name is john doe I", "is john doe I live"... "what is your name"

phrase 2 3-grams:
"my name is", "name is bruce", "is bruce wht's", "bruce wht's your", "wht's your name"
phrase 2 4-grmas:
"my name is bruce", "name is bruce wht's", "is bruce wht's your", "bruce wht's your name"

接下来，您可以在每个n-gram上进行levenstein距离，这应该可以解决您在上面介绍的用例。 如果您需要进一步规范每个单词，可以使用Double Metaphone或NYSIIS之类的语音编码器，但是，我对所有“常用”语音编码器进行了测试，在您的情况下，它没有显示出明显的改进，语音编码器更加有效适合名字。

我对PHP的经验有限，但这是一个代码示例：

<?php
function extract_ngrams($phrase, $min_words, $max_words) {
    echo "Calculating N-Grams for phrase: $phrase\n";
    $ngrams = array();
    $words  = str_word_count(strtolower($phrase), 1);
    $word_count = count($words);

    for ($i = 0; $i <= $word_count - $min_words; $i++) {
        for ($j = $min_words; $j <= $max_words && ($j + $i) <= $word_count; $j++) {
            $ngrams[] = implode(' ',array_slice($words, $i, $j));
        }
    }
    return array_unique($ngrams);
}

function contains_key_phrase($ngrams, $key) {
    foreach ($ngrams as $ngram) {
        if (levenshtein($key, $ngram) < 5) {
            echo "found match: $ngram\n";
            return true;
        }
    }
    return false;
}

$key_phrase = "what is your name";
$phrases = array(
        "hi, my name is john doe. I live in new york. What is your name?",
        "My name is Bruce. wht's your name"
        );
$min_words = 3;
$max_words = 4;

foreach ($phrases as $phrase) {
    $ngrams = extract_ngrams($phrase, $min_words, $max_words);
    if (contains_key_phrase($ngrams,$key_phrase)) {
        echo "Phrase [$phrase] contains the key phrase [$key_phrase]\n";
    }
}
?>

输出是这样的：

Calculating N-Grams for phrase: hi, my name is john doe. I live in new york. What is your name?
found match: what is your name
Phrase [hi, my name is john doe. I live in new york. What is your name?] contains the key phrase [what is your name]
Calculating N-Grams for phrase: My name is Bruce. wht's your name
found match: wht's your name
Phrase [My name is Bruce. wht's your name] contains the key phrase [what is your name]

编辑：我注意到一些建议，以将语音编码添加到生成的n-gram中的每个单词。 我不确定语音编码是否是解决此问题的最佳方法，因为它们主要针对词干名称（根据算法使用美国，德国或法语），并且不太擅长词干普通单词。

我实际上写了一个测试来验证Java的测试（因为编码器更容易获得），这里是输出：

===========================
Created new phonetic matcher
    Engine: Caverphone2
    Key Phrase: what is your name
    Encoded Key Phrase: WT11111111 AS11111111 YA11111111 NM11111111
Found match: [What is your name?] Encoded: WT11111111 AS11111111 YA11111111 NM11111111
Phrase: [hi, my name is john doe. I live in new york. What is your name?] MATCH: true
Phrase: [My name is Bruce. wht's your name] MATCH: false
===========================
Created new phonetic matcher
    Engine: DoubleMetaphone
    Key Phrase: what is your name
    Encoded Key Phrase: AT AS AR NM
Found match: [What is your] Encoded: AT AS AR
Phrase: [hi, my name is john doe. I live in new york. What is your name?] MATCH: true
Found match: [wht's your name] Encoded: ATS AR NM
Phrase: [My name is Bruce. wht's your name] MATCH: true
===========================
Created new phonetic matcher
    Engine: Nysiis
    Key Phrase: what is your name
    Encoded Key Phrase: WAT I YAR NAN
Found match: [What is your name?] Encoded: WAT I YAR NAN
Phrase: [hi, my name is john doe. I live in new york. What is your name?] MATCH: true
Found match: [wht's your name] Encoded: WT YAR NAN
Phrase: [My name is Bruce. wht's your name] MATCH: true
===========================
Created new phonetic matcher
    Engine: Soundex
    Key Phrase: what is your name
    Encoded Key Phrase: W300 I200 Y600 N500
Found match: [What is your name?] Encoded: W300 I200 Y600 N500
Phrase: [hi, my name is john doe. I live in new york. What is your name?] MATCH: true
Phrase: [My name is Bruce. wht's your name] MATCH: false
===========================
Created new phonetic matcher
    Engine: RefinedSoundex
    Key Phrase: what is your name
    Encoded Key Phrase: W06 I03 Y09 N8080
Found match: [What is your name?] Encoded: W06 I03 Y09 N8080
Phrase: [hi, my name is john doe. I live in new york. What is your name?] MATCH: true
Found match: [wht's your name] Encoded: W063 Y09 N8080
Phrase: [My name is Bruce. wht's your name] MATCH: true

在运行这些测试时，我使用的levenshtein距离为4，但我很确定您会发现多个边缘情况，这些情况下使用语音编码器将无法正确匹配。 通过查看示例，您可以看到，由于编码器执行了词干处理，因此以这种方式使用它们时，您实际上更有可能出现误报。 请记住，这些算法最初旨在查找人口普查中具有相同名称，但实际上不是哪个英语单词“听起来”相同的人。

Answer 2

您要实现的目标是一个非常复杂的自然语言处理任务，并且通常需要进行解析。

我要建议的是创建一个句子标记器，将短语分成句子。 然后将每个句子标记在空格，标点符号上分割，并可能还将一些缩写重写为更标准的形式。

然后，您可以创建自定义逻辑，遍历每个句子的标记列表以查找特定含义。 例如：['...'，'what'，'...'，'...'，'your'，'name'，'...'，'...'，'？']也可以表示您的名字。 句子可能是“那么，你叫什么名字？” 或“你叫什么名字？”

我以添加代码为例。 我并不是说您应该使用这种简单的方法。 下面的代码使用php中的自然语言处理库NlpTools （我参与了该库，因此请假定我有偏见）。

 <?php

 include('vendor/autoload.php');

 use \NlpTools\Tokenizers\ClassifierBasedTokenizer;
 use \NlpTools\Classifiers\Classifier;
 use \NlpTools\Tokenizers\WhitespaceTokenizer;
 use \NlpTools\Tokenizers\WhitespaceAndPunctuationTokenizer;
 use \NlpTools\Documents\Document;

 class EndOfSentence implements Classifier
 {
     public function classify(array $classes, Document $d)
     {
         list($token, $before, $after) = $d->getDocumentData();

         $lastchar = substr($token, -1);
         $dotcnt = count(explode('.',$token))-1;

         if (count($after)==0)
             return 'EOW';

         // for some abbreviations
         if ($dotcnt>1)
             return 'O';

         if (in_array($lastchar, array(".","?","!")))
             return 'EOW';
     }
 }

 function normalize($s) {
     // get this somewhere static
     $hash_table = array(
         'whats'=>'what is',
         'whts'=>'what is',
         'what\'s'=>'what is',
         '\'s'=>'is',
         'n\'t'=>'not',
         'ur'=>'your'
         // .... more ....
     );

     $s = mb_strtolower($s,'utf-8');
     if (isset($hash_table[$s]))
         return $hash_table[$s];
     return $s;
 }

 $whitespace_tok = new WhitespaceTokenizer();
 $punct_tok = new WhitespaceAndPunctuationTokenizer();
 $sentence_tok = new ClassifierBasedTokenizer(
     new EndOfSentence(),
     $whitespace_tok
 );

 $text = 'hi, my name is john doe. I live in new york. What\'s your name? whts ur name';

 foreach ($sentence_tok->tokenize($text) as $sentence) {
     $words = $whitespace_tok->tokenize($sentence);
     $words = array_map(
         'normalize',
         $words
     );
     $words = call_user_func_array(
         'array_merge',
         array_map(
             array($punct_tok,'tokenize'),
             $words
         )
     );

     // decide what this sequence of tokens is
     print_r($words);
 }

Answer 3

首先修复所有短代码示例，然后插入什么内容

$txt=$_POST['txt']
$txt=str_ireplace("hw r u","how are You",$txt);
$txt=str_ireplace(" hw "," how ",$txt);//remember an space before and after phrase is required else it will replace all occurrence of hw(even inside a word if hw exists).
$txt=str_ireplace(" r "," are ",$txt);
$txt=str_ireplace(" u "," you ",$txt);
$txt=str_ireplace(" wht's "," What is ",$txt);

同样，根据需要添加任意数量的词组..现在只需检查文本中所有可能的问题并获得其位置

if (strpos($phrase,"What is your name")) {//No need to add "!=" false
    return $response;
}

Answer 4

您可能会考虑使用soundex函数将输入字符串转换为语音等效的文字，然后继续进行搜索。 同音

字符串关键词匹配

问题描述

4 个解决方案

解决方案1
1 已采纳 2013-09-19 21:34:55

解决方案2
1 2013-09-20 20:06:20

解决方案3
0

解决方案4
0 2013-09-17 06:42:44

字符串关键词匹配

问题描述

4 个解决方案

解决方案1 1 已采纳 2013-09-19 21:34:55

解决方案2 1 2013-09-20 20:06:20

解决方案3 0

解决方案4 0 2013-09-17 06:42:44

解决方案1
1 已采纳 2013-09-19 21:34:55

解决方案2
1 2013-09-20 20:06:20

解决方案3
0

解决方案4
0 2013-09-17 06:42:44