简体   繁体   English

字符串关键词匹配

[英]String key phrase matching

In levenstein how are you , hw ru , how are u , and hw ar you can be compare as same, 在levenstein how are youhw ruhow are uhw ar you可以比较为相同,

Is there anyway i can achieved this 反正我能做到这一点

if i have a phrase like. 如果我有一个短语。

phrase 短语

hi, my name is john doe. 嗨,我叫约翰·多伊。 I live in new york. 我住在纽约。 What is your name? 你叫什么名字?

phrase 短语

My name is Bruce. 我叫布鲁斯。 wht's your name 你叫什么名字

key phrase 关键短语

What is your name 你叫什么名字

response 响应

my name is batman. 我叫蝙蝠侠。

im getting the input from user.I have a table with a list of possible request with response. 我从用户那里得到输入。我有一张桌子,上面列出了可能的请求和响应。 for example the user will ask about 'its name', is there a way i can check if a sentence has a key phrase like What is your name and if its found it will return the possible response 例如,用户将询问“其名称”,有没有一种方法可以检查句子中是否包含关键短语(如What is your nameWhat is your name ,如果找到它,它将返回可能的响应

like 喜欢

phrase = ' hi, my name is john doe. I live in new york. What is your name?'

//I know this one will work
if (strpos($phrase,"What is your name") !== false) {
    return $response;
}

//but what if the user mistype it 
if (strpos($phrase,"Wht's your name") !== false) {
    return $response;
}

is there i way to achieve this. 我有办法实现这一目标吗? levenstein works perfect only if the lenght of strings are not that long with the compared string. 只有在字符串的长度与比较的字符串不那么长的情况下,levenstein才可以完美地工作。

like 喜欢

hi,wht's your name 嗨,你叫什么名字

my name is batman. 我叫蝙蝠侠。

but if it so long 但是如果这么久

hi, my name is john doe. 嗨,我叫约翰·多伊。 I live in new york. 我住在纽约。 What is your name? 你叫什么名字?

its not working well. 其效果不佳。 if there are shorter phrase, it will identify the shorter phrase that have a shorter distance and return a wrong response 如果有较短的短语,它将识别距离较短的较短的短语并返回错误的响应

i was thinking another way around is to check some key phrase. 我在想另一种方法是检查一些关键短语。 so any idea to achieve this one? 那么有什么想法可以实现这一目标吗?

i was working on something like this but maybe there is a better and proper way i think 我当时正在做这样的事情,但也许我想有更好更好的方法

$samplePhrase = 'hi, im spongebob, i work at krabby patty. i love patties. Whts your name my friend';

$keyPhrase = 'What is your name';
  1. get first character of keyPhrase . 获取keyPhrase第一个字符。 That would be 'W' iterate through 那将是“ W”迭代
  2. $samplePhrase characters and compare to first character of keyPhrase $samplePhrase字符并与keyPhrase第一个字符进行keyPhrase
  3. h,i, ,i,m, ,s,p etc. . h,i, ,i,m, ,s,p等。 .
  4. if keyPhrase.char = samplePhrase.currentChar 如果keyPhrase.char = samplePhrase.currentChar
  5. get keyPhrase.length 获取keyPhrase.length
  6. get samplePhrase.currentChar index 获取samplePhrase.currentChar索引
  7. get substring of samplePhrase base on the currentChar index to keyPhrase.length 根据keyPhrase.length的currentChar索引获取samplePhrase的子字符串
  8. the first it will get would be work at krabby pa 首先会work at krabby pa
  9. compare work at krabby pa to $keyPhrase ('What is your name') using levenstiens distance 使用levenstiens距离将work at krabby pa与$ keyPhrase(“您叫什么名字”)进行比较
  10. and to check it better use semilar_text. 为了更好地检查它,请使用semilar_text。 11.if not equal and distance is to big repeat process. 11.如果不相等,则距离要大重复过程。

My suggestion would be to generate a list of n-grams from the key phrase and calculate the edit distance between each n-gram and the key phrase. 我的建议是从关键字生成n-gram列表,并计算每个n-gram与关键字之间的编辑距离。

Example: 例:

key phrase: "What is your name"
phrase 1: "hi, my name is john doe. I live in new york. What is your name?"
phrase 2: "My name is Bruce. wht's your name"

A possible matching n-gram would be between 3 and 4 words long, therefore we create all 3-grams and 4-grams for each phrase, we should also normalize the string by removing punctuation and lowercasing everything. 可能的匹配n-gram的长度在3到4个单词之间,因此,我们为每个短语创建了所有3-gram和4-gram,我们还应该通过删除标点符号和小写形式将字符串标准化。

phrase 1 3-grams:
"hi my name", "my name is", "name is john", "is john doe", "john doe I", "doe I live"... "what is your", "is your name"
phrase 1 4-grams:
"hi my name is", "my name is john doe", "name is john doe I", "is john doe I live"... "what is your name"

phrase 2 3-grams:
"my name is", "name is bruce", "is bruce wht's", "bruce wht's your", "wht's your name"
phrase 2 4-grmas:
"my name is bruce", "name is bruce wht's", "is bruce wht's your", "bruce wht's your name"

Next you can do levenstein distance on each n-gram this should solve the use case you presented above. 接下来,您可以在每个n-gram上进行levenstein距离,这应该可以解决您在上面介绍的用例。 if you need to further normalize each word you can use phonetic encoders such as Double Metaphone or NYSIIS, however, I did a test with all the "common" phonetic encoders and in your case it didn't show significant improvement, phonetic encoders are more suitable for names. 如果您需要进一步规范每个单词,可以使用Double Metaphone或NYSIIS之类的语音编码器,但是,我对所有“常用”语音编码器进行了测试,在您的情况下,它没有显示出明显的改进,语音编码器更加有效适合名字。

I have limited experience with PHP but here is a code example: 我对PHP的经验有限,但这是一个代码示例:

<?php
function extract_ngrams($phrase, $min_words, $max_words) {
    echo "Calculating N-Grams for phrase: $phrase\n";
    $ngrams = array();
    $words  = str_word_count(strtolower($phrase), 1);
    $word_count = count($words);

    for ($i = 0; $i <= $word_count - $min_words; $i++) {
        for ($j = $min_words; $j <= $max_words && ($j + $i) <= $word_count; $j++) {
            $ngrams[] = implode(' ',array_slice($words, $i, $j));
        }
    }
    return array_unique($ngrams);
}

function contains_key_phrase($ngrams, $key) {
    foreach ($ngrams as $ngram) {
        if (levenshtein($key, $ngram) < 5) {
            echo "found match: $ngram\n";
            return true;
        }
    }
    return false;
}

$key_phrase = "what is your name";
$phrases = array(
        "hi, my name is john doe. I live in new york. What is your name?",
        "My name is Bruce. wht's your name"
        );
$min_words = 3;
$max_words = 4;

foreach ($phrases as $phrase) {
    $ngrams = extract_ngrams($phrase, $min_words, $max_words);
    if (contains_key_phrase($ngrams,$key_phrase)) {
        echo "Phrase [$phrase] contains the key phrase [$key_phrase]\n";
    }
}
?>

And the output is something like this: 输出是这样的:

Calculating N-Grams for phrase: hi, my name is john doe. I live in new york. What is your name?
found match: what is your name
Phrase [hi, my name is john doe. I live in new york. What is your name?] contains the key phrase [what is your name]
Calculating N-Grams for phrase: My name is Bruce. wht's your name
found match: wht's your name
Phrase [My name is Bruce. wht's your name] contains the key phrase [what is your name]

EDIT : I noticed some suggestions to add phonetic encoding to each word in the generated n-gram. 编辑 :我注意到一些建议,以将语音编码添加到生成的n-gram中的每个单词。 I'm not sure phonetic encoding is the best answer to this problem as they are mostly tuned to stemming names (american, german or french depending on the algorithm) and are not very good at stemming plain words. 我不确定语音编码是否是解决此问题的最佳方法,因为它们主要针对词干名称(根据算法使用美国,德国或法语),并且不太擅长词干普通单词。

I actually wrote a test to validate this in Java (as the encoders are more readily available) here is the output: 我实际上写了一个测试来验证Java的测试(因为编码器更容易获得),这里是输出:

===========================
Created new phonetic matcher
    Engine: Caverphone2
    Key Phrase: what is your name
    Encoded Key Phrase: WT11111111 AS11111111 YA11111111 NM11111111
Found match: [What is your name?] Encoded: WT11111111 AS11111111 YA11111111 NM11111111
Phrase: [hi, my name is john doe. I live in new york. What is your name?] MATCH: true
Phrase: [My name is Bruce. wht's your name] MATCH: false
===========================
Created new phonetic matcher
    Engine: DoubleMetaphone
    Key Phrase: what is your name
    Encoded Key Phrase: AT AS AR NM
Found match: [What is your] Encoded: AT AS AR
Phrase: [hi, my name is john doe. I live in new york. What is your name?] MATCH: true
Found match: [wht's your name] Encoded: ATS AR NM
Phrase: [My name is Bruce. wht's your name] MATCH: true
===========================
Created new phonetic matcher
    Engine: Nysiis
    Key Phrase: what is your name
    Encoded Key Phrase: WAT I YAR NAN
Found match: [What is your name?] Encoded: WAT I YAR NAN
Phrase: [hi, my name is john doe. I live in new york. What is your name?] MATCH: true
Found match: [wht's your name] Encoded: WT YAR NAN
Phrase: [My name is Bruce. wht's your name] MATCH: true
===========================
Created new phonetic matcher
    Engine: Soundex
    Key Phrase: what is your name
    Encoded Key Phrase: W300 I200 Y600 N500
Found match: [What is your name?] Encoded: W300 I200 Y600 N500
Phrase: [hi, my name is john doe. I live in new york. What is your name?] MATCH: true
Phrase: [My name is Bruce. wht's your name] MATCH: false
===========================
Created new phonetic matcher
    Engine: RefinedSoundex
    Key Phrase: what is your name
    Encoded Key Phrase: W06 I03 Y09 N8080
Found match: [What is your name?] Encoded: W06 I03 Y09 N8080
Phrase: [hi, my name is john doe. I live in new york. What is your name?] MATCH: true
Found match: [wht's your name] Encoded: W063 Y09 N8080
Phrase: [My name is Bruce. wht's your name] MATCH: true

I used a levenshtein distance of 4 when running these tests, but I am pretty sure you can find multiple edge cases where using the phonetic encoder will fail to match correctly. 在运行这些测试时,我使用的levenshtein距离为4,但我很确定您会发现多个边缘情况,这些情况下使用语音编码器将无法正确匹配。 by looking at the example you can see that because of the stemming done by the encoders you are actually more likely to have false positives when using them in this way. 通过查看示例,您可以看到,由于编码器执行了词干处理,因此以这种方式使用它们时,您实际上更有可能出现误报。 keep in mind that these algorithms are originally intended to find those people in the population census that have the same name and not really which english words 'sound' the same. 请记住,这些算法最初旨在查找人口普查中具有相同名称,但实际上不是哪个英语单词“听起来”相同的人。

What you are trying to achieve is a quite complex natural language processing task and it usually requires parsing among other things. 您要实现的目标是一个非常复杂的自然语言处理任务,并且通常需要进行解析

What I am going to suggest is to create a sentence tokenizer that will split the phrase into sentences. 我要建议的是创建一个句子标记器 ,将短语分成句子。 Then tokenize each sentence splitting on whitespace, punctuation and probably also rewriting some abbreviations to a more normal form. 然后将每个句子标记在空格,标点符号上分割,并可能还将一些缩写重写为更标准的形式。

Then, you can create custom logic that traverses the token list of each sentence looking for specific meaning. 然后,您可以创建自定义逻辑,遍历每个句子的标记列表以查找特定含义。 Ex.: ['...','what','...','...','your','name','...','...','?'] can also mean what is your name. 例如:['...','what','...','...','your','name','...','...','?']也可以表示您的名字。 The sentence could be "So, what is your name really?" 句子可能是“那么,你叫什么名字?” or "What could your name be?" 或“你叫什么名字?”

I am adding code as an example. 我以添加代码为例。 I am not saying you should use something that simple. 我并不是说您应该使用这种简单的方法。 The code below uses NlpTools a natural language processing library in php (I am involved in the library so feel free to assume I am biased). 下面的代码使用php中的自然语言处理库NlpTools (我参与了该库,因此请假定我有偏见)。

 <?php

 include('vendor/autoload.php');

 use \NlpTools\Tokenizers\ClassifierBasedTokenizer;
 use \NlpTools\Classifiers\Classifier;
 use \NlpTools\Tokenizers\WhitespaceTokenizer;
 use \NlpTools\Tokenizers\WhitespaceAndPunctuationTokenizer;
 use \NlpTools\Documents\Document;

 class EndOfSentence implements Classifier
 {
     public function classify(array $classes, Document $d)
     {
         list($token, $before, $after) = $d->getDocumentData();

         $lastchar = substr($token, -1);
         $dotcnt = count(explode('.',$token))-1;

         if (count($after)==0)
             return 'EOW';

         // for some abbreviations
         if ($dotcnt>1)
             return 'O';

         if (in_array($lastchar, array(".","?","!")))
             return 'EOW';
     }
 }

 function normalize($s) {
     // get this somewhere static
     $hash_table = array(
         'whats'=>'what is',
         'whts'=>'what is',
         'what\'s'=>'what is',
         '\'s'=>'is',
         'n\'t'=>'not',
         'ur'=>'your'
         // .... more ....
     );

     $s = mb_strtolower($s,'utf-8');
     if (isset($hash_table[$s]))
         return $hash_table[$s];
     return $s;
 }

 $whitespace_tok = new WhitespaceTokenizer();
 $punct_tok = new WhitespaceAndPunctuationTokenizer();
 $sentence_tok = new ClassifierBasedTokenizer(
     new EndOfSentence(),
     $whitespace_tok
 );

 $text = 'hi, my name is john doe. I live in new york. What\'s your name? whts ur name';

 foreach ($sentence_tok->tokenize($text) as $sentence) {
     $words = $whitespace_tok->tokenize($sentence);
     $words = array_map(
         'normalize',
         $words
     );
     $words = call_user_func_array(
         'array_merge',
         array_map(
             array($punct_tok,'tokenize'),
             $words
         )
     );

     // decide what this sequence of tokens is
     print_r($words);
 }

First of all fix all short codes example wht's insted of whats 首先修复所有短代码示例,然后插入什么内容

$txt=$_POST['txt']
$txt=str_ireplace("hw r u","how are You",$txt);
$txt=str_ireplace(" hw "," how ",$txt);//remember an space before and after phrase is required else it will replace all occurrence of hw(even inside a word if hw exists).
$txt=str_ireplace(" r "," are ",$txt);
$txt=str_ireplace(" u "," you ",$txt);
$txt=str_ireplace(" wht's "," What is ",$txt);

Similarly Add as many phrases as you want.. now just check all possible questions in this text & get their position 同样,根据需要添加任意数量的词组..现在只需检查文本中所有可能的问题并获得其位置

if (strpos($phrase,"What is your name")) {//No need to add "!=" false
    return $response;
}

You may think of using the soundex function to convert the input string into a phonetically equivalant writing, and then proceed with your search. 您可能会考虑使用soundex函数将输入字符串转换为语音等效的文字,然后继续进行搜索。 soundex 同音

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM