简体   繁体   English

在数组中找到相似的单词并消除它们

[英]Find similar words in an array and eliminate them

$a[] = "paris";
$a[] = "london";
$a[] = "paris";
$a[] = "london tour";
$a[] = "london tours";
$a[] = "london";
$a[] = "londonn";

foreach($a as $name) {

echo $name;
echo '<br>';

}

Output: Output:

paris
london
paris
london tour
london tours
london
londonn

I can eliminate the same words with array_unique我可以用array_unique消除相同的词

foreach(array_unique($a) as $name) {

echo $name;
echo '<br>';

}

Output: Output:

paris
london
london tour
london tours
londonn

I want to take this further and eliminate similar words.我想更进一步并消除类似的词。 Like, if there is a "london", I want to eliminate "londonn".就像,如果有“伦敦”,我想消除“伦敦”。

So the output will be:所以 output 将是:

paris
london
london tour

I tried similar_text($name, $name, $percent) but it did not help.我尝试了similar_text($name, $name, $percent)但没有帮助。

Here is what I tried with my limited of knowledge:这是我在知识有限的情况下尝试过的:

foreach(array_unique($a) as $name) {

$test = $a;
foreach($test as $test1) {

 similar_text($name, $test1, $percent);
if ($percent > 90) {
echo $name;
echo '<br>';
} 

}
}

Output: Output:

paris
paris
london
london
london
london tour
london tour
london tours
london tours
londonn
londonn
londonn

The source of the words is a search list:单词的来源是一个搜索列表:

$a[] = "$popular_search"; $a[] = "$popular_search";

The main problem seems to be the way you use the two nested loops.主要问题似乎是您使用两个嵌套循环的方式。 Here's a very explicit example, without anything fancy, showing how you could do this:这是一个非常明确的示例,没有任何花哨的东西,展示了如何做到这一点:

$a[] = "paris";
$a[] = "london";
$a[] = "paris";
$a[] = "london tour";
$a[] = "london tours";
$a[] = "london";
$a[] = "londonn";

$b = [];
foreach($a as $outerName) {
    // start optimistic, no similar string found
    $isUnique = true;
    foreach($b as $innerName) {
        // check whether the string already has a similar entry
        similar_text($outerName, $innerName, $percent);
        if ($percent > 90) {
            $isUnique = false;
            break;
        }
    }
    if ($isUnique) {
        $b[] = $outerName;
    }
}

print_r($b);

Working example工作示例

The output is: output 是:

Array
(
    [0] => paris
    [1] => london
    [2] => london tour
)

How does it work?它是如何工作的? There's an outer loop that simply goes through all the strings in array $a .有一个外部循环简单地遍历数组$a中的所有字符串。 Inside that loop it loops through the strings $b that have already been identified as being unique enough.在该循环内,它循环遍历已被标识为足够唯一的字符串$b If a string from $a is similar enough to a string of $b we skip it.如果$a中的字符串与$b中的字符串足够相似,我们将跳过它。 That's all.就这样。

You can use the %percent part that the function returns... This returns a percentage of similarity between the 2 inputs.您可以使用 function 返回的 %percent 部分...这将返回 2 个输入之间的相似度百分比。

For a word game I implemented, I used this approach and for me to 'match' the word(s), testing for a percentage of >= 60 to 80 seemed to work for 'most' of my test cases, depends how picky you want it to be!对于我实现的文字游戏,我使用了这种方法,并且对我来说“匹配”单词,测试 >= 60 到 80 的百分比似乎适用于我的“大多数”测试用例,这取决于你有多挑剔想要它!

For my case, to get it accurate, I actually converted the test words to metaphones first:就我而言,为了准确起见,我实际上首先将测试词转换为变音位:

public static function testMetaphone($s1 = "", $s2 = "", $phonemes = 4)
{
    if (empty($s1) || empty($s2)) {
        return false;
    }

    $m1 = metaphone($s1, $phonemes);
    $m2 = metaphone($s2, $phonemes);
    $sim = similar_text($m1, $m2, $perc);
    $logMessage = "M1: {$m1}, M2: {$m2}, Similarity: $sim ($perc %) - Originals text: {$s1} | {$s2}";
    Log::info("testMetaphone: " . $logMessage);
    // Test accuracy
    if ($perc >= 85) {
        return true;
    } else {
        return false;
    }
}

Usage:用法:

$answerCheck = testMetaphone("Toyota", "Totota", 6);

See it in action: https://3v4l.org/KceXD - The above fails, if %-age is 85% but passes if %60.查看实际操作: https://3v4l.org/KceXD - 如果 %-age 为 85%,则上述失败,但如果 %60 则通过。 So, again may need to play with that to find where YOU are happy with its accuracy.因此,可能需要再次使用它来找到您对其准确性感到满意的地方。

For you're case you can loop the array and compare each element with every other element using this function and keep track of each word checked and how many similar entries there is and delete then 'duplicates' accordingly.对于您的情况,您可以循环数组并使用此 function 将每个元素与每个其他元素进行比较,并跟踪检查的每个单词以及有多少相似条目,然后相应地删除然后“重复”。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM