![](/img/trans.png)
[英]Fastest way to compare ordered lists and count common elements *including* duplicates
[英]Fastest way to count frequencies of ordered list entries
我正在计算二进制列表中长度为i
的非重叠分组子序列的出现次数,例如,如果我有一个列表:
[0, 1, 0, 1, 1, 0, 0, 0, 1, 1]
,我想计算[0,0]
(一), [0,1]
(二), [1,0]
(一), [1,1]
(一)。
我创建了一个 function 来完成这个(见下文)。 但是,我想看看是否可以采取任何措施来加快 function 的执行时间。 我已经让它非常快(超过相同功能的先前版本),目前长度 = 100,000 和 i = 2 的列表大约需要 0.03 秒,长度 = 的列表大约需要 30 秒100,000,000 并且 i=2。 (与序列长度相关的时间似乎呈线性增加)。 但是,我的最终目标是使用具有多个i
值的函数来做到这一点,序列长度接近 150 亿。 假设线性成立,仅i
= 2 将需要大约 4.2 小时(更高的i
值需要更长的时间,因为它必须计算更多独特的子序列)。
我不确定是否可以在这里获得更快的速度(至少,在仍然使用 python 工作时),但我愿意接受有关如何更快地完成此任务的建议(使用任何方法或语言)?
def subseq_counter(i,l):
"""counts the frequency of unique, non-overlapping, grouped subsequences of length i in a binary list l"""
grouped = [str(l[k:k + i]) for k in range(0, len(l), i)]
#groups terms into i length subsequences
if len(grouped[len(grouped) - 1]) != len(grouped[0]):
grouped.pop(len(grouped) - 1)
#removes any subsequences at the end that are not of length i
grouped_sort = sorted(grouped)
#necesary so as to make sure the output frequencies correlate to the ascending binary order of the subsequences
grouped_sort_values = Counter(grouped_sort).values()
# counts the elements' frequency
freq_list = list(grouped_sort_values)
return freq_list
我知道通过删除grouped_sorted
行可以获得稍微快一点的执行时间,但是,我需要能够访问与子序列的升序二进制顺序相关的频率(因此对于i
= 2 来说,这将是[0,0],[0,1],[1,0],[1,1]
)并且还没有找到更好的解决方法。
我不知道是否更快,但尝试
import numpy as np
# create data
bits = np.random.randint(0, 2, 10000)
def subseq_counter(i: int, l: np.array):
"""
Counts the number of subsequences of length l in the array i
"""
# the list l is reshaped as a matrix of i columns, and
# matrix-multiplied by the binary weigts "power of 2"
# | [[2**2],
# | [2**1],
# | [2**0]]
# |___________
# [[1,0,1],| 1*4 + 0*2 + 1*1 = 5
# [0,1,0],| 0*4 + 1*2 + 0*1 = 2
# ..., | ....
# [1,1,1]]| 1*4 + 1*2 + 1*1 = 7
iBits = l[:i*(l.size//i)].reshape(-1, i)@(2**np.arange(i).T)
unique, counts = np.unique(iBits, return_counts=True)
print(f"Counts for {i} bits:")
for u, c in zip(unique, counts):
print(f"{u:0{i}b}:{c}")
return unique, counts
subseq_counter(2,bits)
subseq_counter(3,bits)
>>> Counts for 2 bits:
>>> 00:1264
>>> 01:1279
>>> 10:1237
>>> 11:1220
>>> Counts for 3 bits:
>>> 000:425
>>> 001:429
>>> 010:411
>>> 011:395
>>> 100:437
>>> 101:412
>>> 110:407
>>> 111:417
它的作用是将列表重塑为 n 行 x i
列的数组,并通过乘以2**n
转换为 integer ,将00 to 0
、 01 to 1
、 10 to 2
和11 to 3
,然后执行用np.unique()
计数
不太确定我是否理解关于订单的最后一部分。 似乎没有必要建立一个巨大的子序列列表。 使用生成器生成计数器的子序列 - 这样您也不必摆弄索引:
from collections import Counter
def count_subsequences(sequence, subseq_len=2):
return Counter(subseq for subseq in zip(*[iter(sequence)] * subseq_len))
sequence = [0, 1, 0, 1, 1, 0, 0, 0, 1, 1]
counter = count_subsequences(sequence)
for subseq in (0, 0), (0, 1), (1, 0), (1, 1):
print("{}: {}".format(subseq, counter[subseq]))
Output:
(0, 0): 1
(0, 1): 2
(1, 0): 1
(1, 1): 1
>>>
在这种情况下,function 返回计数器 object 本身,调用代码以某种顺序显示结果。
这是一种方法:
from collections import Counter
from itertools import product
def subseq_counter(i,l):
freq_list = [0] * 2 ** i
binaryTupToInt = {binTup:j for j, binTup in enumerate(product((0,1),repeat=i))}
c = Counter(binaryTupToInt[tuple(l[k:k+i])] for k in range(0, len(l) // i * i, i))
for k, v in c.items():
freq_list[k] = v
return freq_list
l = [0, 1, 0, 1, 1, 0, 0, 0, 1, 1]
i = 2
print(subseq_counter(i, l))
输出:
[1, 2, 1, 1]
笔记:
i
更改为3
给出: [0, 1, 1, 0, 0, 0, 1, 0]
这显示了所有可能的长度为3
的二进制值的频率,以从0
(二进制0,0,0
)开始并以7
(二进制1,1,1
)结束的升序排列。 换句话说, 0,0,0
出现0
次, 0,0,1
出现1
次, 0,1,0
出现1
次, 0,1,1
出现0
次等等,通过1,1,1
这发生0
次。i
更改为3
给出: [1, 1, 1]
这个 output 似乎很难破译,因为它没有标记,因此我们可以很容易地看到具有非零值的结果对应于 3 位二进制值0,0,1
, 0,1,0
和1,1,0
。更新:
这是长度为 5500 万的输入列表(其中i
设置为2
)的几种方法的基准,包括 OP、计数排序(此答案)、numpy 包括列表到 ndarray 转换开销和 numpy 没有开销:
foo_1 output:
[10000000, 15000000, 15000000, 15000000]
foo_2 output:
[10000000, 15000000, 15000000, 15000000]
foo_3 output:
[10000000 15000000 15000000 15000000]
foo_4 output:
[10000000 15000000 15000000 15000000]
Timeit results:
foo_1 (OP) ran in 32.20719700001064 seconds using 1 iterations
foo_2 (counting sort) ran in 17.91718759998912 seconds using 1 iterations
foo_3 (numpy with list-to-array conversion) ran in 9.713831000000937 seconds using 1 iterations
foo_4 (numpy) ran in 1.695262699999148 seconds using 1 iterations
明显的赢家是numpy
,尽管除非调用程序可以很容易地更改为使用 ndarrays,否则在此示例中所需的转换会使事情变慢约 5 倍。
基准测试包括我的一些新解决方案:
For i=2:
2.9 s ± 0.0 s Kelly_NumPy
3.7 s ± 0.0 s Kelly_bytes_count
6.6 s ± 0.0 s Kelly_zip
7.8 s ± 0.1 s Colim_numpy
8.4 s ± 0.0 s Paul_genzip
8.6 s ± 0.0 s Kelly_bytes_split2
10.5 s ± 0.0 s Kelly_bytes_slices2
10.6 s ± 0.1 s Kelly_bytes_split1
16.1 s ± 0.0 s Kelly_bytes_slices1
20.9 s ± 0.1 s constantstranger
45.1 s ± 0.3 s original
For i=5:
2.3 s ± 0.0 s Kelly_NumPy
3.8 s ± 0.0 s Kelly_zip
4.5 s ± 0.0 s Paul_genzip
4.5 s ± 0.0 s Kelly_bytes_split2
5.2 s ± 0.0 s Kelly_bytes_split1
5.4 s ± 0.0 s Kelly_bytes_slices2
7.1 s ± 0.0 s Colim_numpy
7.2 s ± 0.0 s Kelly_bytes_slices1
9.3 s ± 0.0 s constantstranger
20.6 s ± 0.0 s Kelly_bytes_count
25.3 s ± 0.1 s original
这是一个长度为 n=1e6 的列表,乘以 100,因此它在一定程度上反映了长度为 1e8 的时间。 我对其他解决方案进行了最低限度的修改,以便它们按照您原来的方式进行操作,即获取 in 列表并以正确的顺序返回 in 列表。 我的一两个较慢的解决方案仅在长度是其块大小的倍数时才有效,我没有费心让它们适用于所有长度,因为它们无论如何都比较慢。
完整代码( 在线试用! ):
def Kelly_NumPy(i, l):
a = np.frombuffer(bytes(l), np.int8)
stop = a.size // i * i
s = a[:stop:i]
for j in range(1, i):
s = (s << 1) | a[j::i]
return np.unique(s, return_counts=True)[1].tolist()
def Kelly_zip(i, l):
ctr = Counter(zip(*[iter(l)]*i))
return [v for k, v in sorted(ctr.items())]
def Kelly_bytes_slices1(i, l):
a = bytes(l)
slices = [a[j:j+i] for j in range(0, len(a)//i*i, i)]
ctr = Counter(slices)
return [v for k, v in sorted(ctr.items())]
def Kelly_bytes_slices2(i, l):
a = bytes(l)
ig = itemgetter(*(slice(j, j+i) for j in range(0, 1000*i, i)))
ctr = Counter(chain.from_iterable(
ig(a[k:k+1000*i])
for k in range(0, len(l), 1000*i)
))
return [v for k, v in sorted(ctr.items())]
def Kelly_bytes_count(i, l):
n = len(l)
a = bytes(l)
b = bytearray([2]) * (n + n//i)
for j in range(i):
b[j+1::i+1] = a[j::i]
a = b
ss = [bytes([2])]
for _ in range(i):
ss = [s+b for s in ss for b in [bytes([0]), bytes([1])]]
return [a.count(s) for s in ss]
def Kelly_bytes_split1(i, l):
n = len(l) // i
stop = n * i
a = bytes(l)
sep = bytearray([2])
b = sep * (stop + n - 1)
for j in range(i):
b[j::i+1] = a[j::i]
ctr = Counter(bytes(b).split(sep))
return [v for k, v in sorted(ctr.items())]
def Kelly_bytes_split2(i, l):
n = len(l) // i
stop = n * i
a = bytes(l)
sep = bytearray([2])
b = sep * (5000*i + 4999)
ctr = Counter()
for k in range(0, stop, 5000*i):
for j in range(i):
b[j::i+1] = a[k+j:k+5000*i+j:i]
ctr.update(bytes(b).split(sep))
return [v for k, v in sorted(ctr.items())]
def original(i,l):
grouped = [str(l[k:k + i]) for k in range(0, len(l), i)]
if len(grouped[len(grouped) - 1]) != len(grouped[0]):
grouped.pop(len(grouped) - 1)
grouped_sort = sorted(grouped)
grouped_sort_values = Counter(grouped_sort).values()
freq_list = list(grouped_sort_values)
return freq_list
def Paul_genzip(subseq_len, sequence):
ctr = Counter(subseq for subseq in zip(*[iter(sequence)] * subseq_len))
return [v for k, v in sorted(ctr.items())]
def constantstranger(i,l):
freq_list = [0] * 2 ** i
binaryTupToInt = {binTup:j for j, binTup in enumerate(product((0,1),repeat=i))}
c = Counter(binaryTupToInt[tuple(l[k:k+i])] for k in range(0, len(l) // i * i, i))
for k, v in c.items():
freq_list[k] = v
return freq_list
def Colim_numpy(i: int, l):
l = np.array(l)
iBits = l[:i*(l.size//i)].reshape(-1, i)@(2**np.arange(i-1,-1,-1).T)
unique, counts = np.unique(iBits, return_counts=True)
return counts.tolist()
funcs = [
original,
Colim_numpy,
Paul_genzip,
constantstranger,
Kelly_NumPy,
Kelly_bytes_count,
Kelly_zip,
Kelly_bytes_slices1,
Kelly_bytes_slices2,
Kelly_bytes_split1,
Kelly_bytes_split2,
]
from time import time
import os
from collections import Counter
from itertools import repeat, chain, product
import numpy as np
from operator import itemgetter
from statistics import mean, stdev
n = 10**6
i = 2
times = {f: [] for f in funcs}
def stats(f):
ts = [t/n*1e8 for t in sorted(times[f])[:3]]
return f'{mean(ts):4.1f} s ± {stdev(ts):3.1f} s '
for _ in range(10):
l = [b % 2 for b in os.urandom(n)]
expect = None
for f in funcs:
t = time()
result = f(i, l)
t = time() - t
times[f].append(t)
if expect is None:
expect = result
else:
assert result == expect
for f in sorted(funcs, key=stats):
print(stats(f), f.__name__,)
这要快得多。 It uses Kelly's idea of using numpy.frombuffer
instead of converting the list to numpy array, and uses Pandas to count unique values, which is faster than numpy.unique
for more than 100 000 results
import pandas as pd
def subseq_counter(i: int, l):
l = np.frombuffer(bytes(l), np.int8)
iBits = l[:i*(l.size//i)].reshape(-1, i)@(2 **np.arange(i-1, -1, -1).T).astype(np.int8)
# bug fix: when not enough data, (higly probable for large i),
# iBits do not has every possible value, so returning unique values
# as list may lose information
answer = [0]*2**i # empty counter including all possible values
if len(iBits) > 100000:
for i, v in pd.value_counts(iBits).items():
answer[i] = v
else:
unique, count = np.unique(iBits, return_counts=True)
for i, v in zip(unique, count):
answer[i] = v
return answer
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.