繁体   English   中英


[英]String key phrase matching

在levenstein how are youhw ruhow are uhw ar you可以比较为相同,




嗨,我叫约翰·多伊。 我住在纽约。 你叫什么名字?


我叫布鲁斯。 你叫什么名字





我从用户那里得到输入。我有一张桌子,上面列出了可能的请求和响应。 例如,用户将询问“其名称”,有没有一种方法可以检查句子中是否包含关键短语(如What is your nameWhat is your name ,如果找到它,它将返回可能的响应


phrase = ' hi, my name is john doe. I live in new york. What is your name?'

//I know this one will work
if (strpos($phrase,"What is your name") !== false) {
    return $response;

//but what if the user mistype it 
if (strpos($phrase,"Wht's your name") !== false) {
    return $response;

我有办法实现这一目标吗? 只有在字符串的长度与比较的字符串不那么长的情况下,levenstein才可以完美地工作。





嗨,我叫约翰·多伊。 我住在纽约。 你叫什么名字?

其效果不佳。 如果有较短的短语,它将识别距离较短的较短的短语并返回错误的响应

我在想另一种方法是检查一些关键短语。 那么有什么想法可以实现这一目标吗?


$samplePhrase = 'hi, im spongebob, i work at krabby patty. i love patties. Whts your name my friend';

$keyPhrase = 'What is your name';
  1. 获取keyPhrase第一个字符。 那将是“ W”迭代
  2. $samplePhrase字符并与keyPhrase第一个字符进行keyPhrase
  3. h,i, ,i,m, ,s,p等。
  4. 如果keyPhrase.char = samplePhrase.currentChar
  5. 获取keyPhrase.length
  6. 获取samplePhrase.currentChar索引
  7. 根据keyPhrase.length的currentChar索引获取samplePhrase的子字符串
  8. 首先会work at krabby pa
  9. 使用levenstiens距离将work at krabby pa与$ keyPhrase(“您叫什么名字”)进行比较
  10. 为了更好地检查它,请使用semilar_text。 11.如果不相等,则距离要大重复过程。



key phrase: "What is your name"
phrase 1: "hi, my name is john doe. I live in new york. What is your name?"
phrase 2: "My name is Bruce. wht's your name"


phrase 1 3-grams:
"hi my name", "my name is", "name is john", "is john doe", "john doe I", "doe I live"... "what is your", "is your name"
phrase 1 4-grams:
"hi my name is", "my name is john doe", "name is john doe I", "is john doe I live"... "what is your name"

phrase 2 3-grams:
"my name is", "name is bruce", "is bruce wht's", "bruce wht's your", "wht's your name"
phrase 2 4-grmas:
"my name is bruce", "name is bruce wht's", "is bruce wht's your", "bruce wht's your name"

接下来,您可以在每个n-gram上进行levenstein距离,这应该可以解决您在上面介绍的用例。 如果您需要进一步规范每个单词,可以使用Double Metaphone或NYSIIS之类的语音编码器,但是,我对所有“常用”语音编码器进行了测试,在您的情况下,它没有显示出明显的改进,语音编码器更加有效适合名字。


function extract_ngrams($phrase, $min_words, $max_words) {
    echo "Calculating N-Grams for phrase: $phrase\n";
    $ngrams = array();
    $words  = str_word_count(strtolower($phrase), 1);
    $word_count = count($words);

    for ($i = 0; $i <= $word_count - $min_words; $i++) {
        for ($j = $min_words; $j <= $max_words && ($j + $i) <= $word_count; $j++) {
            $ngrams[] = implode(' ',array_slice($words, $i, $j));
    return array_unique($ngrams);

function contains_key_phrase($ngrams, $key) {
    foreach ($ngrams as $ngram) {
        if (levenshtein($key, $ngram) < 5) {
            echo "found match: $ngram\n";
            return true;
    return false;

$key_phrase = "what is your name";
$phrases = array(
        "hi, my name is john doe. I live in new york. What is your name?",
        "My name is Bruce. wht's your name"
$min_words = 3;
$max_words = 4;

foreach ($phrases as $phrase) {
    $ngrams = extract_ngrams($phrase, $min_words, $max_words);
    if (contains_key_phrase($ngrams,$key_phrase)) {
        echo "Phrase [$phrase] contains the key phrase [$key_phrase]\n";


Calculating N-Grams for phrase: hi, my name is john doe. I live in new york. What is your name?
found match: what is your name
Phrase [hi, my name is john doe. I live in new york. What is your name?] contains the key phrase [what is your name]
Calculating N-Grams for phrase: My name is Bruce. wht's your name
found match: wht's your name
Phrase [My name is Bruce. wht's your name] contains the key phrase [what is your name]

编辑 :我注意到一些建议,以将语音编码添加到生成的n-gram中的每个单词。 我不确定语音编码是否是解决此问题的最佳方法,因为它们主要针对词干名称(根据算法使用美国,德国或法语),并且不太擅长词干普通单词。


Created new phonetic matcher
    Engine: Caverphone2
    Key Phrase: what is your name
    Encoded Key Phrase: WT11111111 AS11111111 YA11111111 NM11111111
Found match: [What is your name?] Encoded: WT11111111 AS11111111 YA11111111 NM11111111
Phrase: [hi, my name is john doe. I live in new york. What is your name?] MATCH: true
Phrase: [My name is Bruce. wht's your name] MATCH: false
Created new phonetic matcher
    Engine: DoubleMetaphone
    Key Phrase: what is your name
    Encoded Key Phrase: AT AS AR NM
Found match: [What is your] Encoded: AT AS AR
Phrase: [hi, my name is john doe. I live in new york. What is your name?] MATCH: true
Found match: [wht's your name] Encoded: ATS AR NM
Phrase: [My name is Bruce. wht's your name] MATCH: true
Created new phonetic matcher
    Engine: Nysiis
    Key Phrase: what is your name
    Encoded Key Phrase: WAT I YAR NAN
Found match: [What is your name?] Encoded: WAT I YAR NAN
Phrase: [hi, my name is john doe. I live in new york. What is your name?] MATCH: true
Found match: [wht's your name] Encoded: WT YAR NAN
Phrase: [My name is Bruce. wht's your name] MATCH: true
Created new phonetic matcher
    Engine: Soundex
    Key Phrase: what is your name
    Encoded Key Phrase: W300 I200 Y600 N500
Found match: [What is your name?] Encoded: W300 I200 Y600 N500
Phrase: [hi, my name is john doe. I live in new york. What is your name?] MATCH: true
Phrase: [My name is Bruce. wht's your name] MATCH: false
Created new phonetic matcher
    Engine: RefinedSoundex
    Key Phrase: what is your name
    Encoded Key Phrase: W06 I03 Y09 N8080
Found match: [What is your name?] Encoded: W06 I03 Y09 N8080
Phrase: [hi, my name is john doe. I live in new york. What is your name?] MATCH: true
Found match: [wht's your name] Encoded: W063 Y09 N8080
Phrase: [My name is Bruce. wht's your name] MATCH: true

在运行这些测试时,我使用的levenshtein距离为4,但我很确定您会发现多个边缘情况,这些情况下使用语音编码器将无法正确匹配。 通过查看示例,您可以看到,由于编码器执行了词干处理,因此以这种方式使用它们时,您实际上更有可能出现误报。 请记住,这些算法最初旨在查找人口普查中具有相同名称,但实际上不是哪个英语单词“听起来”相同的人。


我要建议的是创建一个句子标记器 ,将短语分成句子。 然后将每个句子标记在空格,标点符号上分割,并可能还将一些缩写重写为更标准的形式。

然后,您可以创建自定义逻辑,遍历每个句子的标记列表以查找特定含义。 例如:['...','what','...','...','your','name','...','...','?']也可以表示您的名字。 句子可能是“那么,你叫什么名字?” 或“你叫什么名字?”

我以添加代码为例。 我并不是说您应该使用这种简单的方法。 下面的代码使用php中的自然语言处理库NlpTools (我参与了该库,因此请假定我有偏见)。



 use \NlpTools\Tokenizers\ClassifierBasedTokenizer;
 use \NlpTools\Classifiers\Classifier;
 use \NlpTools\Tokenizers\WhitespaceTokenizer;
 use \NlpTools\Tokenizers\WhitespaceAndPunctuationTokenizer;
 use \NlpTools\Documents\Document;

 class EndOfSentence implements Classifier
     public function classify(array $classes, Document $d)
         list($token, $before, $after) = $d->getDocumentData();

         $lastchar = substr($token, -1);
         $dotcnt = count(explode('.',$token))-1;

         if (count($after)==0)
             return 'EOW';

         // for some abbreviations
         if ($dotcnt>1)
             return 'O';

         if (in_array($lastchar, array(".","?","!")))
             return 'EOW';

 function normalize($s) {
     // get this somewhere static
     $hash_table = array(
         'whats'=>'what is',
         'whts'=>'what is',
         'what\'s'=>'what is',
         // .... more ....

     $s = mb_strtolower($s,'utf-8');
     if (isset($hash_table[$s]))
         return $hash_table[$s];
     return $s;

 $whitespace_tok = new WhitespaceTokenizer();
 $punct_tok = new WhitespaceAndPunctuationTokenizer();
 $sentence_tok = new ClassifierBasedTokenizer(
     new EndOfSentence(),

 $text = 'hi, my name is john doe. I live in new york. What\'s your name? whts ur name';

 foreach ($sentence_tok->tokenize($text) as $sentence) {
     $words = $whitespace_tok->tokenize($sentence);
     $words = array_map(
     $words = call_user_func_array(

     // decide what this sequence of tokens is


$txt=str_ireplace("hw r u","how are You",$txt);
$txt=str_ireplace(" hw "," how ",$txt);//remember an space before and after phrase is required else it will replace all occurrence of hw(even inside a word if hw exists).
$txt=str_ireplace(" r "," are ",$txt);
$txt=str_ireplace(" u "," you ",$txt);
$txt=str_ireplace(" wht's "," What is ",$txt);


if (strpos($phrase,"What is your name")) {//No need to add "!=" false
    return $response;

您可能会考虑使用soundex函数将输入字符串转换为语音等效的文字,然后继续进行搜索。 同音


声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

粤ICP备18138465号  © 2020-2024 STACKOOM.COM