基于键值相似度的分组数组

Question

假设我有一个像这样的数组：

$data[0]['name'] = 'product 1 brandX';
$data[0]['id_product'] = '77777777';
$data[1]['name'] = 'brandX product 1';
$data[1]['id_product'] = '77777777';
$data[2]['name'] = 'brandX product 1 RED';
$data[2]['id_product'] = '77777777';
$data[3]['name'] = 'product 1 brandX';
$data[3]['id_product'] = '';
$data[4]['name'] = 'product 2 brandY';
$data[4]['id_product'] = '8888888';
$data[5]['name'] = 'product 2 brandY RED';
$data[5]['id_product'] = '';

我试图按它们的相似性（名称或id_product）对它们进行分组。

那将是预期的最终数组：

$uniques[0]['name'] = 'product 1 brandX'; //The smallest name for the product
$uniques[0]['count'] = 4; //Entry which has all the words of the smallest name or the same id_product
$uniques[0]['name'] = 'product 2 brandY';
$uniques[0]['count'] = 2;

到目前为止，这就是我尝试过的：

foreach ($data as $t) {
    if (!isset($uniques[$t['id_product']]['name']) || mb_strlen($uniques[$t['id_product']]['name']) > mb_strlen($t['name'])) {
        $uniques[$t['id_product']]['name'] = $t['name'];
        $uniques[$t['id_product']]['count']++;
    }
}

但是我不能基于id_product，因为有时它会是同一产品，但是一个将具有id，而另一个将没有。 我也必须检查名称，但无法完成。

Answer 1

我认为这不会解决您的问题，但可能会让您再次前进

    $data = [];

    $data[0]['name']       = 'product 1 brandX';
    $data[0]['id_product'] = '77777777';
    $data[1]['name']       = 'brandX product 1';
    $data[1]['id_product'] = '77777777';
    $data[2]['name']       = 'brandX product 1 RED';
    $data[2]['id_product'] = '77777777';
    $data[3]['name']       = 'product 1 brandX';
    $data[3]['id_product'] = '';
    $data[4]['name']       = 'product 2 brandY';
    $data[4]['id_product'] = '8888888';
    $data[5]['name']       = 'product 2 brandY RED';
    $data[5]['id_product'] = '';

    $data = collect($data);

    $tallies = [
        'brand_x' => 0,
        'brand_y' => 0,
        'other'   => 0
    ];

    $unique = $data->unique(function ($item) use (&$tallies){
        switch(true){
            case(strpos($item['name'], 'brandX') !== false):
                $tallies['brand_x']++;

                return 'product X';
                break;

            case(strpos($item['name'], 'brandY') !== false):
                $tallies['brand_y']++;

                return 'product Y';
                break;

            default:
                $tallies['other']++;

                return 'other';
                break;
        }
    });


    print_r($unique);
    print_r($tallies);

Answer 2

我认为解决此问题的最佳方法是使用唯一的product_id ，但如果要通过在名称字段中查找相似性来创建唯一键，则可以使用preg_split将名称转换为数组，然后使用array_diff查找差异数组。 如果2个名称的差异计数小于2，则认为它们是唯一的。我创建此函数，它以$arr返回相似的名称，如果未找到则返回false ：

function get_similare_key($arr, $name) {

    $names = preg_split("/\s+/", $name); 

    // get similaire key from $arr
    foreach( $arr as $key => $value ) {

        $key_names = preg_split("/\s+/", $key); 
        $diff = array_diff($key_names, $names); 
        if ( count($diff) <= 1 ) { 
            return $key;
        }

    }

    return false;

}

这是一个工作演示在这里

Answer 3

我的答案基于关于产品应如何分组的两个假设：

尽管id_product可能会丢失，但它存在的位置正确且足以匹配两个产品； 和
要使两个产品名称匹配，最长的name （单词最多的名称）必须包含最短name （单词最少的name ）中的所有单词。

根据这些假设，下面是一个函数，用于确定两个单独的产品是否匹配（产品应分组在一起），以及一个辅助函数，用于从名称中获取单词：

function productsMatch(array $product1, array $product2)
{
    if (
        !empty($product1['id_product'])
        && !empty($product2['id_product'])
        && $product1['id_product'] === $product2['id_product']
    ) {
        // match based on id_product
        return true;
    }
    $words1 = getWordsFromProduct($product1);
    $words2 = getWordsFromProduct($product2);
    $min_word_count = min(count($words1), count($words2));
    $match_word_count = count(array_intersect_key($words1, $words2));
    if ($min_word_count >= 1 && $match_word_count === $min_word_count) {
        // match based on name similarity
        return true;
    }
    // no match
    return false;
}

function getWordsFromProduct(array $product)
{
    $name = mb_strtolower($product['name']);
    preg_match_all('/\S+/', $name, $matches);
    $words = array_flip($matches[0]);
    return $words;
}

此功能可用于对产品进行分组：

function groupProducts(array $data)
{
    $groups = array();
    foreach ($data as $product1) {
        foreach ($groups as $key => $products) {
            foreach ($products as $product2) {
                if (productsMatch($product1, $product2)) {
                    $groups[$key][] = $product1;
                    continue 3; // foreach ($data as $product1)

                }
            }
        }
        $groups[] = array($product1);
    }
    return $groups;
}

然后可以使用此函数来提取最短名称并为每个组计数：

function uniqueProducts(array $groups)
{
    $uniques = array();
    foreach ($groups as $products) {
        $shortest_name = '';
        $shortest_length = PHP_INT_MAX;
        $count = 0;
        foreach ($products as $product) {
            $length = mb_strlen($product['name']);
            if ($length < $shortest_length) {
                $shortest_name = $product['name'];
                $shortest_length = $length;
            }
            $count++;
        }
        $uniques[] = array(
            'name' => $shortest_name,
            'count' => $count,
        );
    }
    return $uniques;
}

因此，结合所有4个功能，您可以获得如下所示的唯一性（使用php 5.6测试）：

$data[0]['name'] = 'product 1 brandX';
$data[0]['id_product'] = '77777777';
$data[1]['name'] = 'brandX product 1';
$data[1]['id_product'] = '77777777';
$data[2]['name'] = 'brandX product 1 RED';
$data[2]['id_product'] = '77777777';
$data[3]['name'] = 'product 1 brandX';
$data[3]['id_product'] = '';
$data[4]['name'] = 'product 2 brandY';
$data[4]['id_product'] = '8888888';
$data[5]['name'] = 'product 2 brandY RED';
$data[5]['id_product'] = '';

$groups = groupProducts($data);
$uniques = uniqueProducts($groups);
var_dump($uniques);

给出输出：

array(2) {
  [0]=>
  array(2) {
    ["name"]=>
    string(16) "product 1 brandX"
    ["count"]=>
    int(4)
  }
  [1]=>
  array(2) {
    ["name"]=>
    string(16) "product 2 brandY"
    ["count"]=>
    int(2)
  }
}

基于键值相似度的分组数组

问题描述

3 个解决方案

解决方案1
0 2018-01-09 19:24:17

解决方案2
0 2018-01-09 20:02:04

解决方案3
0 已采纳 2018-01-11 16:46:23

基于键值相似度的分组数组

问题描述

3 个解决方案

解决方案1 0 2018-01-09 19:24:17

解决方案2 0 2018-01-09 20:02:04

解决方案3 0 已采纳 2018-01-11 16:46:23

解决方案1
0 2018-01-09 19:24:17

解决方案2
0 2018-01-09 20:02:04

解决方案3
0 已采纳 2018-01-11 16:46:23