繁体   English   中英

基于键值相似度的分组数组

[英]Grouping array based on the key value similarity

假设我有一个像这样的数组:

$data[0]['name'] = 'product 1 brandX';
$data[0]['id_product'] = '77777777';
$data[1]['name'] = 'brandX product 1';
$data[1]['id_product'] = '77777777';
$data[2]['name'] = 'brandX product 1 RED';
$data[2]['id_product'] = '77777777';
$data[3]['name'] = 'product 1 brandX';
$data[3]['id_product'] = '';
$data[4]['name'] = 'product 2 brandY';
$data[4]['id_product'] = '8888888';
$data[5]['name'] = 'product 2 brandY RED';
$data[5]['id_product'] = '';

我试图按它们的相似性(名称或id_product)对它们进行分组。

那将是预期的最终数组:

$uniques[0]['name'] = 'product 1 brandX'; //The smallest name for the product
$uniques[0]['count'] = 4; //Entry which has all the words of the smallest name or the same id_product
$uniques[0]['name'] = 'product 2 brandY';
$uniques[0]['count'] = 2;

到目前为止,这就是我尝试过的:

foreach ($data as $t) {
    if (!isset($uniques[$t['id_product']]['name']) || mb_strlen($uniques[$t['id_product']]['name']) > mb_strlen($t['name'])) {
        $uniques[$t['id_product']]['name'] = $t['name'];
        $uniques[$t['id_product']]['count']++;
    }
}

但是我不能基于id_product,因为有时它会是同一产品,但是一个将具有id,而另一个将没有。 我也必须检查名称,但无法完成。

我认为这不会解决您的问题,但可能会让您再次前进

    $data = [];

    $data[0]['name']       = 'product 1 brandX';
    $data[0]['id_product'] = '77777777';
    $data[1]['name']       = 'brandX product 1';
    $data[1]['id_product'] = '77777777';
    $data[2]['name']       = 'brandX product 1 RED';
    $data[2]['id_product'] = '77777777';
    $data[3]['name']       = 'product 1 brandX';
    $data[3]['id_product'] = '';
    $data[4]['name']       = 'product 2 brandY';
    $data[4]['id_product'] = '8888888';
    $data[5]['name']       = 'product 2 brandY RED';
    $data[5]['id_product'] = '';

    $data = collect($data);

    $tallies = [
        'brand_x' => 0,
        'brand_y' => 0,
        'other'   => 0
    ];

    $unique = $data->unique(function ($item) use (&$tallies){
        switch(true){
            case(strpos($item['name'], 'brandX') !== false):
                $tallies['brand_x']++;

                return 'product X';
                break;

            case(strpos($item['name'], 'brandY') !== false):
                $tallies['brand_y']++;

                return 'product Y';
                break;

            default:
                $tallies['other']++;

                return 'other';
                break;
        }
    });


    print_r($unique);
    print_r($tallies);

我认为解决此问题的最佳方法是使用唯一的product_id ,但如果要通过在名称字段中查找相似性来创建唯一键,则可以使用preg_split将名称转换为数组,然后使用array_diff查找差异数组。 如果2个名称的差异计数小于2,则认为它们是唯一的。我创建此函数,它以$arr返回相似的名称,如果未找到则返回false

function get_similare_key($arr, $name) {

    $names = preg_split("/\s+/", $name); 

    // get similaire key from $arr
    foreach( $arr as $key => $value ) {

        $key_names = preg_split("/\s+/", $key); 
        $diff = array_diff($key_names, $names); 
        if ( count($diff) <= 1 ) { 
            return $key;
        }

    }

    return false;

}

这是一个工作演示在这里

我的答案基于关于产品应如何分组的两个假设:

  1. 尽管id_product可能会丢失,但它存在的位置正确且足以匹配两个产品;

  2. 要使两个产品名称匹配,最长的name (单词最多的名称)必须包含最短name (单词最少的name )中的所有单词。

根据这些假设,下面是一个函数,用于确定两个单独的产品是否匹配(产品应分组在一起),以及一个辅助函数,用于从名称中获取单词:

function productsMatch(array $product1, array $product2)
{
    if (
        !empty($product1['id_product'])
        && !empty($product2['id_product'])
        && $product1['id_product'] === $product2['id_product']
    ) {
        // match based on id_product
        return true;
    }
    $words1 = getWordsFromProduct($product1);
    $words2 = getWordsFromProduct($product2);
    $min_word_count = min(count($words1), count($words2));
    $match_word_count = count(array_intersect_key($words1, $words2));
    if ($min_word_count >= 1 && $match_word_count === $min_word_count) {
        // match based on name similarity
        return true;
    }
    // no match
    return false;
}

function getWordsFromProduct(array $product)
{
    $name = mb_strtolower($product['name']);
    preg_match_all('/\S+/', $name, $matches);
    $words = array_flip($matches[0]);
    return $words;
}

此功能可用于对产品进行分组:

function groupProducts(array $data)
{
    $groups = array();
    foreach ($data as $product1) {
        foreach ($groups as $key => $products) {
            foreach ($products as $product2) {
                if (productsMatch($product1, $product2)) {
                    $groups[$key][] = $product1;
                    continue 3; // foreach ($data as $product1)

                }
            }
        }
        $groups[] = array($product1);
    }
    return $groups;
}

然后可以使用此函数来提取最短名称并为每个组计数:

function uniqueProducts(array $groups)
{
    $uniques = array();
    foreach ($groups as $products) {
        $shortest_name = '';
        $shortest_length = PHP_INT_MAX;
        $count = 0;
        foreach ($products as $product) {
            $length = mb_strlen($product['name']);
            if ($length < $shortest_length) {
                $shortest_name = $product['name'];
                $shortest_length = $length;
            }
            $count++;
        }
        $uniques[] = array(
            'name' => $shortest_name,
            'count' => $count,
        );
    }
    return $uniques;
}

因此,结合所有4个功能,您可以获得如下所示的唯一性(使用php 5.6测试):

$data[0]['name'] = 'product 1 brandX';
$data[0]['id_product'] = '77777777';
$data[1]['name'] = 'brandX product 1';
$data[1]['id_product'] = '77777777';
$data[2]['name'] = 'brandX product 1 RED';
$data[2]['id_product'] = '77777777';
$data[3]['name'] = 'product 1 brandX';
$data[3]['id_product'] = '';
$data[4]['name'] = 'product 2 brandY';
$data[4]['id_product'] = '8888888';
$data[5]['name'] = 'product 2 brandY RED';
$data[5]['id_product'] = '';

$groups = groupProducts($data);
$uniques = uniqueProducts($groups);
var_dump($uniques); 

给出输出:

array(2) {
  [0]=>
  array(2) {
    ["name"]=>
    string(16) "product 1 brandX"
    ["count"]=>
    int(4)
  }
  [1]=>
  array(2) {
    ["name"]=>
    string(16) "product 2 brandY"
    ["count"]=>
    int(2)
  }
}

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM