[英]Python: Optimise the function to find frequent itemsets of size k given candidate itemsets
I have written a function to find frequency of itemsets of size k given candidate itemsets.我写了一个 function 来查找给定候选项目集大小为 k 的项目集的频率。 Dataset contains more than 16000 transactions.
数据集包含超过 16000 个事务。 Can someone please help me in optimizing this function as with current form it is taking about 45 minutes to execute with minSupport=1.
有人可以帮我优化这个 function,因为使用 minSupport=1 执行当前形式大约需要 45 分钟。
Sample dataset样本数据集
Algorithm 0 (See other algorithms below)算法 0 (参见下面的其他算法)
Implemented boost of your algorithm using Numba .使用Numba实现了算法的提升。 Numba is a JIT compiler that converts Python code to very highly optimized C++ code and then compiles to machine code.
Numba 是一个JIT编译器,它将 Python 代码转换为高度优化的 C++ 代码,然后编译为机器代码。 For many algorithms Numba achieves speed boost of 50-200x times.
对于许多算法,Numba 实现了 50-200 倍的速度提升。
To use numba you have to install it through pip install numba
, notice that Numba is only supported for Python <= 3.8, for 3.9 it is not yet released!要使用 numba,您必须通过
pip install numba
安装它,注意 Numba 仅支持 Python <= 3.8,对于 3.9,它尚未发布!
I have rewritten your code a bit to satisfy Numba compilation requirements, my code should be identical by behaviour to yours, please do some tests.我已经稍微重写了你的代码以满足 Numba 编译要求,我的代码在行为上应该与你的相同,请做一些测试。
My numba optimized code should give you very good speedup!我的 numba 优化代码应该会给你很好的加速!
I created some artificial short example input data too, to make tests.我也创建了一些人工的简短示例输入数据,以进行测试。
import numba, numpy as np, pandas as pd
@numba.njit(cache = True)
def selectLkNm(dataSet,Ck,minSupport):
dict_data = {}
transactions = dataSet.shape[0]
for items in Ck:
count = 0
while count < transactions:
if items not in dict_data:
dict_data[items] = 0
for item in items:
for e in dataSet[count, :]:
if item == e:
break
else:
break
else:
dict_data[items] += 1
count += 1
Lk = {}
for k, v in dict_data.items():
if v >= minSupport:
Lk[k] = v
return Lk
def selectLk(dataSet, Ck, minSupport):
tCk = numba.typed.List()
for e in Ck:
tCk.append(e)
return selectLkNm(dataSet.values, tCk, minSupport)
dataset = pd.DataFrame([[100,160,100,160],[170,180,190,200],[100,160,190,200]])
C1 = set()
C1.add((100, 160))
C1.add((170, 180))
C1.add((190, 200))
Lk = selectLk(dataset, C1, 2)
print(Lk)
Output: Output:
{(100, 160): 2, (190, 200): 2}
Algorithm 1 (See other algorithms below)算法 1 (参见下面的其他算法)
I improved Algorithm 0 (above) by sorting your data, it will give a good speedup if you have many values inside your Ck or each tuple inside Ck is quite long.我通过对数据进行排序改进了算法 0(上图),如果 Ck 中有很多值或者 Ck 中的每个元组都很长,它将提供很好的加速。
import numba, numpy as np, pandas as pd
@numba.njit(cache = True)
def selectLkNm(dataSet,Ck,minSupport):
assert dataSet.ndim == 2
dataSet2 = np.empty_like(dataSet)
for i in range(dataSet.shape[0]):
dataSet2[i] = np.sort(dataSet[i])
dataSet = dataSet2
dict_data = {}
transactions = dataSet.shape[0]
for items in Ck:
count = 0
while count < transactions:
if items not in dict_data:
dict_data[items] = 0
for item in items:
ix = np.searchsorted(dataSet[count, :], item)
if not (ix < dataSet.shape[1] and dataSet[count, ix] == item):
break
else:
dict_data[items] += 1
count += 1
Lk = {}
for k, v in dict_data.items():
if v >= minSupport:
Lk[k] = v
return Lk
def selectLk(dataSet, Ck, minSupport):
tCk = numba.typed.List()
for e in Ck:
tCk.append(e)
return selectLkNm(dataSet.values, tCk, minSupport)
dataset = pd.DataFrame([[100,160,100,160],[170,180,190,200],[100,160,190,200]])
C1 = set()
C1.add((100, 160))
C1.add((170, 180))
C1.add((190, 200))
Lk = selectLk(dataset, C1, 2)
print(Lk)
Output: Output:
{(100, 160): 2, (190, 200): 2}
Algorithm 2 (See other algorithms below)算法 2 (参见下面的其他算法)
If you're not allowed to use Numba, then I suggest you next improvements to your algorithm.如果您不允许使用 Numba,那么我建议您对算法进行下一步改进。 I pre-sort your dataset to make search of each item not in
O(N)
time but in O(Log(N))
time which is much much faster.我对您的数据集进行了预先排序,以便在
O(N)
时间内而不是在O(Log(N))
时间内搜索每个项目,这要快得多。
I see in your code you used pandas dataframe, it means you have installed pandas, and if you installed pandas then you definitely have Numpy, so I decided to use it. I see in your code you used pandas dataframe, it means you have installed pandas, and if you installed pandas then you definitely have Numpy, so I decided to use it. You can't have no Numpy if you're dealing with pandas dataframe.
如果您正在处理 pandas dataframe,则不能没有 Numpy。
import numpy as np, pandas as pd, collections
def selectLk(dataSet,Ck,minSupport):
dataSet = np.sort(dataSet.values, axis = 1)
dict_data = collections.defaultdict(int)
transactions = dataSet.shape[0]
for items in Ck:
count = 0
while count < transactions:
for item in items:
ix = np.searchsorted(dataSet[count, :], item)
if not (ix < dataSet.shape[1] and dataSet[count, ix] == item):
break
else:
dict_data[items] += 1
count += 1
Lk = {k : v for k, v in dict_data.items() if v >= minSupport}
return Lk
dataset = pd.DataFrame([[100,160,100,160],[170,180,190,200],[100,160,190,200]])
C1 = set()
C1.add((100, 160))
C1.add((170, 180))
C1.add((190, 200))
Lk = selectLk(dataset, C1, 2)
print(Lk)
Output: Output:
{(100, 160): 2, (190, 200): 2}
Algorithm 3算法 3
I just had an idea that sorting part of Algorithm 2 may be not the bottleneck, probably transactions while loop can be a bottleneck.我只是有一个想法,算法 2 的排序部分可能不是瓶颈,可能事务 while 循环可能是瓶颈。
So to improve situation I decided to implement and use a faster algorithm with 2D searchsorted version (there is no built-in 2D version, so it had to be implemented separately), which doesn't have any long pure-python loops, most time is spent in Numpy functions.因此,为了改善情况,我决定使用 2D searchsorted 版本实现并使用更快的算法(没有内置的 2D 版本,因此必须单独实现),大多数时候没有任何长的纯 python 循环用在 Numpy 函数中。
Please try if this Algo 3 will be faster, it should be only faster if not sorting was a bottleneck but inner while loop.请尝试这个算法 3 是否会更快,如果排序不是瓶颈而是内部 while 循环,它应该会更快。
import numpy as np, pandas as pd, collections
def selectLk(dataSet, Ck, minSupport):
def searchsorted2d(a, bs):
s = np.r_[0, (np.maximum(a.max(1) - a.min(1) + 1, bs.ravel().max(0)) + 1).cumsum()[:-1]]
a_scaled = (a + s[:, None]).ravel()
def sub(b):
b_scaled = b + s
return np.searchsorted(a_scaled, b_scaled) - np.arange(len(s)) * a.shape[1]
return sub
assert dataSet.values.ndim == 2, dataSet.values.ndim
dataSet = np.sort(dataSet.values, axis = 1)
dict_data = collections.defaultdict(int)
transactions = dataSet.shape[0]
Ck = np.array(list(Ck))
assert Ck.ndim == 2, Ck.ndim
ss = searchsorted2d(dataSet, Ck)
for items in Ck:
cnts = np.zeros((dataSet.shape[0],), dtype = np.int64)
for item in items:
bs = item.repeat(dataSet.shape[0])
ixs = np.minimum(ss(bs), dataSet.shape[1] - 1)
cnts[...] += (dataSet[(np.arange(dataSet.shape[0]), ixs)] == bs).astype(np.uint8)
dict_data[tuple(items)] += int((cnts == len(items)).sum())
return {k : v for k, v in dict_data.items() if v >= minSupport}
dataset = pd.DataFrame([[100,160,100,160],[170,180,190,200],[100,160,190,200]])
C1 = set()
C1.add((100, 160))
C1.add((170, 180))
C1.add((190, 200))
Lk = selectLk(dataset, C1, 2)
print(Lk)
Output: Output:
{(100, 160): 2, (190, 200): 2}
I have changed the order of execution of your code.我已经更改了您的代码的执行顺序。 However, since I do not have access to your actual input data, it is difficult to check if the optimized code produces expected outputs and how much speed up you gained.
但是,由于我无法访问您的实际输入数据,因此很难检查优化后的代码是否产生预期的输出以及您获得了多少速度。
import pandas as pd
import numpy as np
from collections import defaultdict
def selectLk(dataSet,Ck,minSupport):
dict_data = defaultdict(int)
for _, row in dataSet.iterrows():
for items in Ck:
dict_data[items] += all(item in row.values for item in items)
Lk = { k : v for k,v in dict_data.items() if v > minSupport}
return Lk
if __name__ == '__main__':
data = list(range(0, 1000, 10))
df_data = {}
for i in range(26):
sample = np.random.choice(data, size=16000, replace=True)
df_data[f"d{i}"] = sample
dataset = pd.DataFrame(df_data)
C1 = set()
C1.add((100, 160))
C1.add((170, 180))
C1.add((190, 200))
Lk1 = selectLk(dataset, C1, 1)
dataset = pd.DataFrame([[100,160,100,160],[170,180,190,200],[100,160,190,200]])
Lk2 = selectLk(dataset, C1, 1)
print(Lk1)
print(Lk2)
Algorithm 1 utilizes numpy.equal.outer
, which creates a boolean mask of any matching elements in the Ck tuples.算法 1 使用
numpy.equal.outer
,它创建 Ck 元组中任何匹配元素的 boolean 掩码。 Then, apply .all()
operation.然后,应用
.all()
操作。
def selectLk(dataSet, Ck, minSupport):
dict_data = defaultdict(int)
dataSet_np = dataSet.to_numpy(copy=False)
for items in Ck:
dict_data[items] = dataSet[np.equal.outer(dataSet_np, items).any(axis=1).all(axis=1)].shape[0]
Lk = { k : v for k, v in dict_data.items() if v > minSupport}
return Lk
Result:结果:
{(190, 200): 811, (170, 180): 797, (100, 160): 798}
{(190, 200): 2, (100, 160): 2}
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.