[英]lua Regular expression for matching key phrase with value adopted from PHP
[英]String key phrase matching
在levenstein how are you
, hw ru
, how are u
和hw ar you
可以比较为相同,
反正我能做到这一点
如果我有一个短语。
短语
嗨,我叫约翰·多伊。 我住在纽约。 你叫什么名字?
短语
我叫布鲁斯。 你叫什么名字
关键短语
你叫什么名字
响应
我叫蝙蝠侠。
我从用户那里得到输入。我有一张桌子,上面列出了可能的请求和响应。 例如,用户将询问“其名称”,有没有一种方法可以检查句子中是否包含关键短语(如What is your name
是What is your name
,如果找到它,它将返回可能的响应
喜欢
phrase = ' hi, my name is john doe. I live in new york. What is your name?'
//I know this one will work
if (strpos($phrase,"What is your name") !== false) {
return $response;
}
//but what if the user mistype it
if (strpos($phrase,"Wht's your name") !== false) {
return $response;
}
我有办法实现这一目标吗? 只有在字符串的长度与比较的字符串不那么长的情况下,levenstein才可以完美地工作。
喜欢
嗨,你叫什么名字
我叫蝙蝠侠。
但是如果这么久
嗨,我叫约翰·多伊。 我住在纽约。 你叫什么名字?
其效果不佳。 如果有较短的短语,它将识别距离较短的较短的短语并返回错误的响应
我在想另一种方法是检查一些关键短语。 那么有什么想法可以实现这一目标吗?
我当时正在做这样的事情,但也许我想有更好更好的方法
$samplePhrase = 'hi, im spongebob, i work at krabby patty. i love patties. Whts your name my friend';
$keyPhrase = 'What is your name';
keyPhrase
第一个字符。 那将是“ W”迭代 $samplePhrase
字符并与keyPhrase
第一个字符进行keyPhrase
h,i, ,i,m, ,s,p
等。 。 keyPhrase.char = samplePhrase.currentChar
work at krabby pa
work at krabby pa
与$ keyPhrase(“您叫什么名字”)进行比较 我的建议是从关键字生成n-gram列表,并计算每个n-gram与关键字之间的编辑距离。
例:
key phrase: "What is your name"
phrase 1: "hi, my name is john doe. I live in new york. What is your name?"
phrase 2: "My name is Bruce. wht's your name"
可能的匹配n-gram的长度在3到4个单词之间,因此,我们为每个短语创建了所有3-gram和4-gram,我们还应该通过删除标点符号和小写形式将字符串标准化。
phrase 1 3-grams:
"hi my name", "my name is", "name is john", "is john doe", "john doe I", "doe I live"... "what is your", "is your name"
phrase 1 4-grams:
"hi my name is", "my name is john doe", "name is john doe I", "is john doe I live"... "what is your name"
phrase 2 3-grams:
"my name is", "name is bruce", "is bruce wht's", "bruce wht's your", "wht's your name"
phrase 2 4-grmas:
"my name is bruce", "name is bruce wht's", "is bruce wht's your", "bruce wht's your name"
接下来,您可以在每个n-gram上进行levenstein距离,这应该可以解决您在上面介绍的用例。 如果您需要进一步规范每个单词,可以使用Double Metaphone或NYSIIS之类的语音编码器,但是,我对所有“常用”语音编码器进行了测试,在您的情况下,它没有显示出明显的改进,语音编码器更加有效适合名字。
我对PHP的经验有限,但这是一个代码示例:
<?php
function extract_ngrams($phrase, $min_words, $max_words) {
echo "Calculating N-Grams for phrase: $phrase\n";
$ngrams = array();
$words = str_word_count(strtolower($phrase), 1);
$word_count = count($words);
for ($i = 0; $i <= $word_count - $min_words; $i++) {
for ($j = $min_words; $j <= $max_words && ($j + $i) <= $word_count; $j++) {
$ngrams[] = implode(' ',array_slice($words, $i, $j));
}
}
return array_unique($ngrams);
}
function contains_key_phrase($ngrams, $key) {
foreach ($ngrams as $ngram) {
if (levenshtein($key, $ngram) < 5) {
echo "found match: $ngram\n";
return true;
}
}
return false;
}
$key_phrase = "what is your name";
$phrases = array(
"hi, my name is john doe. I live in new york. What is your name?",
"My name is Bruce. wht's your name"
);
$min_words = 3;
$max_words = 4;
foreach ($phrases as $phrase) {
$ngrams = extract_ngrams($phrase, $min_words, $max_words);
if (contains_key_phrase($ngrams,$key_phrase)) {
echo "Phrase [$phrase] contains the key phrase [$key_phrase]\n";
}
}
?>
输出是这样的:
Calculating N-Grams for phrase: hi, my name is john doe. I live in new york. What is your name? found match: what is your name Phrase [hi, my name is john doe. I live in new york. What is your name?] contains the key phrase [what is your name] Calculating N-Grams for phrase: My name is Bruce. wht's your name found match: wht's your name Phrase [My name is Bruce. wht's your name] contains the key phrase [what is your name]
编辑 :我注意到一些建议,以将语音编码添加到生成的n-gram中的每个单词。 我不确定语音编码是否是解决此问题的最佳方法,因为它们主要针对词干名称(根据算法使用美国,德国或法语),并且不太擅长词干普通单词。
我实际上写了一个测试来验证Java的测试(因为编码器更容易获得),这里是输出:
=========================== Created new phonetic matcher Engine: Caverphone2 Key Phrase: what is your name Encoded Key Phrase: WT11111111 AS11111111 YA11111111 NM11111111 Found match: [What is your name?] Encoded: WT11111111 AS11111111 YA11111111 NM11111111 Phrase: [hi, my name is john doe. I live in new york. What is your name?] MATCH: true Phrase: [My name is Bruce. wht's your name] MATCH: false =========================== Created new phonetic matcher Engine: DoubleMetaphone Key Phrase: what is your name Encoded Key Phrase: AT AS AR NM Found match: [What is your] Encoded: AT AS AR Phrase: [hi, my name is john doe. I live in new york. What is your name?] MATCH: true Found match: [wht's your name] Encoded: ATS AR NM Phrase: [My name is Bruce. wht's your name] MATCH: true =========================== Created new phonetic matcher Engine: Nysiis Key Phrase: what is your name Encoded Key Phrase: WAT I YAR NAN Found match: [What is your name?] Encoded: WAT I YAR NAN Phrase: [hi, my name is john doe. I live in new york. What is your name?] MATCH: true Found match: [wht's your name] Encoded: WT YAR NAN Phrase: [My name is Bruce. wht's your name] MATCH: true =========================== Created new phonetic matcher Engine: Soundex Key Phrase: what is your name Encoded Key Phrase: W300 I200 Y600 N500 Found match: [What is your name?] Encoded: W300 I200 Y600 N500 Phrase: [hi, my name is john doe. I live in new york. What is your name?] MATCH: true Phrase: [My name is Bruce. wht's your name] MATCH: false =========================== Created new phonetic matcher Engine: RefinedSoundex Key Phrase: what is your name Encoded Key Phrase: W06 I03 Y09 N8080 Found match: [What is your name?] Encoded: W06 I03 Y09 N8080 Phrase: [hi, my name is john doe. I live in new york. What is your name?] MATCH: true Found match: [wht's your name] Encoded: W063 Y09 N8080 Phrase: [My name is Bruce. wht's your name] MATCH: true
在运行这些测试时,我使用的levenshtein距离为4,但我很确定您会发现多个边缘情况,这些情况下使用语音编码器将无法正确匹配。 通过查看示例,您可以看到,由于编码器执行了词干处理,因此以这种方式使用它们时,您实际上更有可能出现误报。 请记住,这些算法最初旨在查找人口普查中具有相同名称,但实际上不是哪个英语单词“听起来”相同的人。
您要实现的目标是一个非常复杂的自然语言处理任务,并且通常需要进行解析 。
我要建议的是创建一个句子标记器 ,将短语分成句子。 然后将每个句子标记在空格,标点符号上分割,并可能还将一些缩写重写为更标准的形式。
然后,您可以创建自定义逻辑,遍历每个句子的标记列表以查找特定含义。 例如:['...','what','...','...','your','name','...','...','?']也可以表示您的名字。 句子可能是“那么,你叫什么名字?” 或“你叫什么名字?”
我以添加代码为例。 我并不是说您应该使用这种简单的方法。 下面的代码使用php中的自然语言处理库NlpTools (我参与了该库,因此请假定我有偏见)。
<?php
include('vendor/autoload.php');
use \NlpTools\Tokenizers\ClassifierBasedTokenizer;
use \NlpTools\Classifiers\Classifier;
use \NlpTools\Tokenizers\WhitespaceTokenizer;
use \NlpTools\Tokenizers\WhitespaceAndPunctuationTokenizer;
use \NlpTools\Documents\Document;
class EndOfSentence implements Classifier
{
public function classify(array $classes, Document $d)
{
list($token, $before, $after) = $d->getDocumentData();
$lastchar = substr($token, -1);
$dotcnt = count(explode('.',$token))-1;
if (count($after)==0)
return 'EOW';
// for some abbreviations
if ($dotcnt>1)
return 'O';
if (in_array($lastchar, array(".","?","!")))
return 'EOW';
}
}
function normalize($s) {
// get this somewhere static
$hash_table = array(
'whats'=>'what is',
'whts'=>'what is',
'what\'s'=>'what is',
'\'s'=>'is',
'n\'t'=>'not',
'ur'=>'your'
// .... more ....
);
$s = mb_strtolower($s,'utf-8');
if (isset($hash_table[$s]))
return $hash_table[$s];
return $s;
}
$whitespace_tok = new WhitespaceTokenizer();
$punct_tok = new WhitespaceAndPunctuationTokenizer();
$sentence_tok = new ClassifierBasedTokenizer(
new EndOfSentence(),
$whitespace_tok
);
$text = 'hi, my name is john doe. I live in new york. What\'s your name? whts ur name';
foreach ($sentence_tok->tokenize($text) as $sentence) {
$words = $whitespace_tok->tokenize($sentence);
$words = array_map(
'normalize',
$words
);
$words = call_user_func_array(
'array_merge',
array_map(
array($punct_tok,'tokenize'),
$words
)
);
// decide what this sequence of tokens is
print_r($words);
}
首先修复所有短代码示例,然后插入什么内容
$txt=$_POST['txt']
$txt=str_ireplace("hw r u","how are You",$txt);
$txt=str_ireplace(" hw "," how ",$txt);//remember an space before and after phrase is required else it will replace all occurrence of hw(even inside a word if hw exists).
$txt=str_ireplace(" r "," are ",$txt);
$txt=str_ireplace(" u "," you ",$txt);
$txt=str_ireplace(" wht's "," What is ",$txt);
同样,根据需要添加任意数量的词组..现在只需检查文本中所有可能的问题并获得其位置
if (strpos($phrase,"What is your name")) {//No need to add "!=" false
return $response;
}
您可能会考虑使用soundex函数将输入字符串转换为语音等效的文字,然后继续进行搜索。 同音
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.