Python：优化 function 以找到给定候选项目集大小为 k 的频繁项目集

Question

I have written a function to find frequency of itemsets of size k given candidate itemsets.我写了一个 function 来查找给定候选项目集大小为 k 的项目集的频率。 Dataset contains more than 16000 transactions.数据集包含超过 16000 个事务。 Can someone please help me in optimizing this function as with current form it is taking about 45 minutes to execute with minSupport=1.有人可以帮我优化这个 function，因为使用 minSupport=1 执行当前形式大约需要 45 分钟。

Sample dataset样本数据集

Answer 1

Algorithm 0 (See other algorithms below)算法 0 （参见下面的其他算法）

Implemented boost of your algorithm using Numba .使用Numba实现了算法的提升。 Numba is a JIT compiler that converts Python code to very highly optimized C++ code and then compiles to machine code. Numba 是一个JIT编译器，它将 Python 代码转换为高度优化的 C++ 代码，然后编译为机器代码。 For many algorithms Numba achieves speed boost of 50-200x times.对于许多算法，Numba 实现了 50-200 倍的速度提升。

To use numba you have to install it through pip install numba , notice that Numba is only supported for Python <= 3.8, for 3.9 it is not yet released!要使用 numba，您必须通过pip install numba安装它，注意 Numba 仅支持 Python <= 3.8，对于 3.9，它尚未发布！

I have rewritten your code a bit to satisfy Numba compilation requirements, my code should be identical by behaviour to yours, please do some tests.我已经稍微重写了你的代码以满足 Numba 编译要求，我的代码在行为上应该与你的相同，请做一些测试。

My numba optimized code should give you very good speedup!我的 numba 优化代码应该会给你很好的加速！

I created some artificial short example input data too, to make tests.我也创建了一些人工的简短示例输入数据，以进行测试。

Try it online!在线尝试！

import numba, numpy as np, pandas as pd

@numba.njit(cache = True)
def selectLkNm(dataSet,Ck,minSupport):
    dict_data = {}
    transactions = dataSet.shape[0]
    for items in Ck:
        count = 0
        while count < transactions:
            if items not in dict_data:
                dict_data[items] = 0
            for item in items:
                for e in dataSet[count, :]:
                    if item == e:
                        break
                else:
                    break
            else:
                dict_data[items] += 1
            count += 1
    Lk = {}
    for k, v in dict_data.items():
        if v >= minSupport:
            Lk[k] = v
    return Lk
    
def selectLk(dataSet, Ck, minSupport):
    tCk = numba.typed.List()
    for e in Ck:
        tCk.append(e)
    return selectLkNm(dataSet.values, tCk, minSupport)

dataset = pd.DataFrame([[100,160,100,160],[170,180,190,200],[100,160,190,200]])
C1 = set()
C1.add((100, 160))
C1.add((170, 180))
C1.add((190, 200))
Lk = selectLk(dataset, C1, 2)
print(Lk)

Output: Output：

{(100, 160): 2, (190, 200): 2}

Algorithm 1 (See other algorithms below)算法 1 （参见下面的其他算法）

I improved Algorithm 0 (above) by sorting your data, it will give a good speedup if you have many values inside your Ck or each tuple inside Ck is quite long.我通过对数据进行排序改进了算法 0（上图），如果 Ck 中有很多值或者 Ck 中的每个元组都很长，它将提供很好的加速。

Try it online! 在线尝试！

import numba, numpy as np, pandas as pd

@numba.njit(cache = True)
def selectLkNm(dataSet,Ck,minSupport):
    assert dataSet.ndim == 2
    dataSet2 = np.empty_like(dataSet)
    for i in range(dataSet.shape[0]):
        dataSet2[i] = np.sort(dataSet[i])
    dataSet = dataSet2
    dict_data = {}
    transactions = dataSet.shape[0]
    for items in Ck:
        count = 0
        while count < transactions:
            if items not in dict_data:
                dict_data[items] = 0
            for item in items:
                ix = np.searchsorted(dataSet[count, :], item)
                if not (ix < dataSet.shape[1] and dataSet[count, ix] == item):
                    break
            else:
                dict_data[items] += 1
            count += 1
    Lk = {}
    for k, v in dict_data.items():
        if v >= minSupport:
            Lk[k] = v
    return Lk
    
def selectLk(dataSet, Ck, minSupport):
    tCk = numba.typed.List()
    for e in Ck:
        tCk.append(e)
    return selectLkNm(dataSet.values, tCk, minSupport)

dataset = pd.DataFrame([[100,160,100,160],[170,180,190,200],[100,160,190,200]])
C1 = set()
C1.add((100, 160))
C1.add((170, 180))
C1.add((190, 200))
Lk = selectLk(dataset, C1, 2)
print(Lk)

Output: Output：

{(100, 160): 2, (190, 200): 2}

Algorithm 2 (See other algorithms below)算法 2 （参见下面的其他算法）

If you're not allowed to use Numba, then I suggest you next improvements to your algorithm.如果您不允许使用 Numba，那么我建议您对算法进行下一步改进。 I pre-sort your dataset to make search of each item not in O(N) time but in O(Log(N)) time which is much much faster.我对您的数据集进行了预先排序，以便在O(N)时间内而不是在O(Log(N))时间内搜索每个项目，这要快得多。

I see in your code you used pandas dataframe, it means you have installed pandas, and if you installed pandas then you definitely have Numpy, so I decided to use it. I see in your code you used pandas dataframe, it means you have installed pandas, and if you installed pandas then you definitely have Numpy, so I decided to use it. You can't have no Numpy if you're dealing with pandas dataframe.如果您正在处理 pandas dataframe，则不能没有 Numpy。

Try it online! 在线尝试！

import numpy as np, pandas as pd, collections

def selectLk(dataSet,Ck,minSupport):
    dataSet = np.sort(dataSet.values, axis = 1)
    dict_data = collections.defaultdict(int)
    transactions = dataSet.shape[0]
    for items in Ck:
        count = 0
        while count < transactions:
            for item in items:
                ix = np.searchsorted(dataSet[count, :], item)
                if not (ix < dataSet.shape[1] and dataSet[count, ix] == item):
                    break
            else:
                dict_data[items] += 1
            count += 1
    Lk = {k : v for k, v in dict_data.items() if v >= minSupport}
    return Lk
    
dataset = pd.DataFrame([[100,160,100,160],[170,180,190,200],[100,160,190,200]])
C1 = set()
C1.add((100, 160))
C1.add((170, 180))
C1.add((190, 200))
Lk = selectLk(dataset, C1, 2)
print(Lk)

Output: Output：

{(100, 160): 2, (190, 200): 2}

Algorithm 3算法 3

I just had an idea that sorting part of Algorithm 2 may be not the bottleneck, probably transactions while loop can be a bottleneck.我只是有一个想法，算法 2 的排序部分可能不是瓶颈，可能事务 while 循环可能是瓶颈。

So to improve situation I decided to implement and use a faster algorithm with 2D searchsorted version (there is no built-in 2D version, so it had to be implemented separately), which doesn't have any long pure-python loops, most time is spent in Numpy functions.因此，为了改善情况，我决定使用 2D searchsorted 版本实现并使用更快的算法（没有内置的 2D 版本，因此必须单独实现），大多数时候没有任何长的纯 python 循环用在 Numpy 函数中。

Please try if this Algo 3 will be faster, it should be only faster if not sorting was a bottleneck but inner while loop.请尝试这个算法 3 是否会更快，如果排序不是瓶颈而是内部 while 循环，它应该会更快。

Try it online! 在线尝试！

import numpy as np, pandas as pd, collections

def selectLk(dataSet, Ck, minSupport):
    def searchsorted2d(a, bs):
        s = np.r_[0, (np.maximum(a.max(1) - a.min(1) + 1, bs.ravel().max(0)) + 1).cumsum()[:-1]]
        a_scaled = (a + s[:, None]).ravel()
        def sub(b):
            b_scaled = b + s
            return np.searchsorted(a_scaled, b_scaled) - np.arange(len(s)) * a.shape[1]
        return sub

    assert dataSet.values.ndim == 2, dataSet.values.ndim
    dataSet = np.sort(dataSet.values, axis = 1)
    dict_data = collections.defaultdict(int)
    transactions = dataSet.shape[0]
    Ck = np.array(list(Ck))
    assert Ck.ndim == 2, Ck.ndim
    ss = searchsorted2d(dataSet, Ck)
    for items in Ck:
        cnts = np.zeros((dataSet.shape[0],), dtype = np.int64)
        for item in items:
            bs = item.repeat(dataSet.shape[0])
            ixs = np.minimum(ss(bs), dataSet.shape[1] - 1)
            cnts[...] += (dataSet[(np.arange(dataSet.shape[0]), ixs)] == bs).astype(np.uint8)
        dict_data[tuple(items)] += int((cnts == len(items)).sum())
    return {k : v for k, v in dict_data.items() if v >= minSupport}
    
dataset = pd.DataFrame([[100,160,100,160],[170,180,190,200],[100,160,190,200]])
C1 = set()
C1.add((100, 160))
C1.add((170, 180))
C1.add((190, 200))
Lk = selectLk(dataset, C1, 2)
print(Lk)

Output: Output：

{(100, 160): 2, (190, 200): 2}

Answer 2

I have changed the order of execution of your code.我已经更改了您的代码的执行顺序。 However, since I do not have access to your actual input data, it is difficult to check if the optimized code produces expected outputs and how much speed up you gained.但是，由于我无法访问您的实际输入数据，因此很难检查优化后的代码是否产生预期的输出以及您获得了多少速度。

Algorithm 0算法 0

import pandas as pd
import numpy as np
from collections import defaultdict

def selectLk(dataSet,Ck,minSupport):
    dict_data = defaultdict(int)
    for _, row in dataSet.iterrows():
        for items in Ck:
            dict_data[items] += all(item in row.values for item in items)
    Lk = { k : v for k,v in dict_data.items() if v > minSupport}
    return Lk

if __name__ == '__main__':
    data = list(range(0, 1000, 10))
    df_data = {}
    for i in range(26):
        sample = np.random.choice(data, size=16000, replace=True)
        df_data[f"d{i}"] = sample
    dataset = pd.DataFrame(df_data)
    C1 = set()
    C1.add((100, 160))
    C1.add((170, 180))
    C1.add((190, 200))
    Lk1 = selectLk(dataset, C1, 1)
    dataset = pd.DataFrame([[100,160,100,160],[170,180,190,200],[100,160,190,200]])
    Lk2 = selectLk(dataset, C1, 1)
    print(Lk1)
    print(Lk2)

Algorithm 1算法 1

Algorithm 1 utilizes numpy.equal.outer , which creates a boolean mask of any matching elements in the Ck tuples.算法 1 使用numpy.equal.outer ，它创建 Ck 元组中任何匹配元素的 boolean 掩码。 Then, apply .all() operation.然后，应用.all()操作。

def selectLk(dataSet, Ck, minSupport):
    dict_data = defaultdict(int)
    dataSet_np = dataSet.to_numpy(copy=False)
    for items in Ck:
        dict_data[items] = dataSet[np.equal.outer(dataSet_np, items).any(axis=1).all(axis=1)].shape[0]
    Lk = { k : v for k, v in dict_data.items() if v > minSupport}
    return Lk

Result:结果：

{(190, 200): 811, (170, 180): 797, (100, 160): 798}
{(190, 200): 2, (100, 160): 2}

Python：优化 function 以找到给定候选项目集大小为 k 的频繁项目集

问题描述

2 个解决方案

解决方案1
2 已采纳 2021-02-25 16:44:46

解决方案2
1 2021-02-25 16:53:46

Algorithm 0算法 0

Algorithm 1算法 1

Python：优化 function 以找到给定候选项目集大小为 k 的频繁项目集

问题描述

2 个解决方案

解决方案1 2 已采纳 2021-02-25 16:44:46

解决方案2 1 2021-02-25 16:53:46

Algorithm 0算法 0

Algorithm 1算法 1

解决方案1
2 已采纳 2021-02-25 16:44:46

解决方案2
1 2021-02-25 16:53:46