繁体   English   中英

计算有序列表条目频率的最快方法

[英]Fastest way to count frequencies of ordered list entries

我正在计算二进制列表中长度为i的非重叠分组子序列的出现次数,例如,如果我有一个列表:
[0, 1, 0, 1, 1, 0, 0, 0, 1, 1] ,我想计算[0,0] (一), [0,1] (二), [1,0] (一), [1,1] (一)。

我创建了一个 function 来完成这个(见下文)。 但是,我想看看是否可以采取任何措施来加快 function 的执行时间。 我已经让它非常快(超过相同功能的先前版本),目前长度 = 100,000 和 i = 2 的列表大约需要 0.03 秒,长度 = 的列表大约需要 30 秒100,000,000 并且 i=2。 (与序列长度相关的时间似乎呈线性增加)。 但是,我的最终目标是使用具有多个i值的函数来做到这一点,序列长度接近 150 亿。 假设线性成立,仅i = 2 将需要大约 4.2 小时(更高的i值需要更长的时间,因为它必须计算更多独特的子序列)。

我不确定是否可以在这里获得更快的速度(至少,在仍然使用 python 工作时),但我愿意接受有关如何更快地完成此任务的建议(使用任何方法或语言)?

def subseq_counter(i,l):
    """counts the frequency of unique, non-overlapping, grouped subsequences of length i in a binary list l"""
    grouped = [str(l[k:k + i]) for k in range(0, len(l), i)] 
    #groups terms into i length subsequences
    if len(grouped[len(grouped) - 1]) != len(grouped[0]):
        grouped.pop(len(grouped) - 1)
    #removes any subsequences at the end that are not of length i
    grouped_sort = sorted(grouped) 
    #necesary so as to make sure the output frequencies correlate to the ascending binary order of the subsequences
    grouped_sort_values = Counter(grouped_sort).values() 
    # counts the elements' frequency
    freq_list = list(grouped_sort_values)
    return freq_list

我知道通过删除grouped_sorted行可以获得稍微快一点的执行时间,但是,我需要能够访问与子序列的升序二进制顺序相关的频率(因此对于i = 2 来说,这将是[0,0],[0,1],[1,0],[1,1] )并且还没有找到更好的解决方法。

我不知道是否更快,但尝试


import numpy as np

# create data
bits = np.random.randint(0, 2, 10000)


def subseq_counter(i: int, l: np.array):
    """
    Counts the number of subsequences of length l in the array i
    """
    # the list l is reshaped as a matrix of i columns, and
    # matrix-multiplied by the binary weigts "power of 2"
    #          |  [[2**2],
    #          |   [2**1],
    #          |   [2**0]]
    #          |___________
    # [[1,0,1],| 1*4 + 0*2 + 1*1 = 5
    #  [0,1,0],| 0*4 + 1*2 + 0*1 = 2
    #  ...,    | ....
    #  [1,1,1]]| 1*4 + 1*2 + 1*1 = 7
    iBits = l[:i*(l.size//i)].reshape(-1, i)@(2**np.arange(i).T)

    unique, counts = np.unique(iBits, return_counts=True)

    print(f"Counts for {i} bits:")
    for u, c in zip(unique, counts):
        print(f"{u:0{i}b}:{c}")
        
    return unique, counts

subseq_counter(2,bits)
subseq_counter(3,bits)


>>> Counts for 2 bits:
>>> 00:1264
>>> 01:1279
>>> 10:1237
>>> 11:1220
>>> Counts for 3 bits:
>>> 000:425
>>> 001:429
>>> 010:411
>>> 011:395
>>> 100:437
>>> 101:412
>>> 110:407
>>> 111:417

它的作用是将列表重塑为 n 行 x i列的数组,并通过乘以2**n转换为 integer ,将00 to 001 to 110 to 211 to 3 ,然后执行用np.unique()计数

不太确定我是否理解关于订单的最后一部分。 似乎没有必要建立一个巨大的子序列列表。 使用生成器生成计数器的子序列 - 这样您也不必摆弄索引:

from collections import Counter


def count_subsequences(sequence, subseq_len=2):
    return Counter(subseq for subseq in zip(*[iter(sequence)] * subseq_len))

sequence = [0, 1, 0, 1, 1, 0, 0, 0, 1, 1]
counter = count_subsequences(sequence)

for subseq in (0, 0), (0, 1), (1, 0), (1, 1):
    print("{}: {}".format(subseq, counter[subseq]))

Output:

(0, 0): 1
(0, 1): 2
(1, 0): 1
(1, 1): 1
>>> 

在这种情况下,function 返回计数器 object 本身,调用代码以某种顺序显示结果。

这是一种方法:

from collections import Counter
from itertools import product

def subseq_counter(i,l):
    freq_list = [0] * 2 ** i
    binaryTupToInt = {binTup:j for j, binTup in enumerate(product((0,1),repeat=i))}
    c = Counter(binaryTupToInt[tuple(l[k:k+i])] for k in range(0, len(l) // i * i, i))
    for k, v in c.items():
        freq_list[k] = v
    return freq_list

l = [0, 1, 0, 1, 1, 0, 0, 0, 1, 1]
i = 2
print(subseq_counter(i, l))

输出:

[1, 2, 1, 1]

笔记:

  • 使用上面的代码并将i更改为3给出:
     [0, 1, 1, 0, 0, 0, 1, 0]
    这显示了所有可能的长度为3的二进制值的频率,以从0 (二进制0,0,0 )开始并以7 (二进制1,1,1 )结束的升序排列。 换句话说, 0,0,0出现0次, 0,0,1出现1次, 0,1,0出现1次, 0,1,1出现0次等等,通过1,1,1这发生0次。
  • 使用问题中的代码并将i更改为3给出:
     [1, 1, 1]
    这个 output 似乎很难破译,因为它没有标记,因此我们可以很容易地看到具有非零值的结果对应于 3 位二进制值0,0,1 , 0,1,01,1,0

更新:

这是长度为 5500 万的输入列表(其中i设置为2 )的几种方法的基准,包括 OP、计数排序(此答案)、numpy 包括列表到 ndarray 转换开销和 numpy 没有开销:

foo_1 output:
[10000000, 15000000, 15000000, 15000000]
foo_2 output:
[10000000, 15000000, 15000000, 15000000]
foo_3 output:
[10000000 15000000 15000000 15000000]
foo_4 output:
[10000000 15000000 15000000 15000000]
Timeit results:
foo_1 (OP) ran in 32.20719700001064 seconds using 1 iterations
foo_2 (counting sort) ran in 17.91718759998912 seconds using 1 iterations
foo_3 (numpy with list-to-array conversion) ran in 9.713831000000937 seconds using 1 iterations
foo_4 (numpy) ran in 1.695262699999148 seconds using 1 iterations

明显的赢家是numpy ,尽管除非调用程序可以很容易地更改为使用 ndarrays,否则在此示例中所需的转换会使事情变慢约 5 倍。

基准测试包括我的一些新解决方案:

For i=2:
 2.9 s ± 0.0 s  Kelly_NumPy
 3.7 s ± 0.0 s  Kelly_bytes_count
 6.6 s ± 0.0 s  Kelly_zip
 7.8 s ± 0.1 s  Colim_numpy
 8.4 s ± 0.0 s  Paul_genzip
 8.6 s ± 0.0 s  Kelly_bytes_split2
10.5 s ± 0.0 s  Kelly_bytes_slices2
10.6 s ± 0.1 s  Kelly_bytes_split1
16.1 s ± 0.0 s  Kelly_bytes_slices1
20.9 s ± 0.1 s  constantstranger
45.1 s ± 0.3 s  original

For i=5:
 2.3 s ± 0.0 s  Kelly_NumPy
 3.8 s ± 0.0 s  Kelly_zip
 4.5 s ± 0.0 s  Paul_genzip
 4.5 s ± 0.0 s  Kelly_bytes_split2
 5.2 s ± 0.0 s  Kelly_bytes_split1
 5.4 s ± 0.0 s  Kelly_bytes_slices2
 7.1 s ± 0.0 s  Colim_numpy
 7.2 s ± 0.0 s  Kelly_bytes_slices1
 9.3 s ± 0.0 s  constantstranger
20.6 s ± 0.0 s  Kelly_bytes_count
25.3 s ± 0.1 s  original

这是一个长度为 n=1e6 的列表,乘以 100,因此它在一定程度上反映了长度为 1e8 的时间。 我对其他解决方案进行了最低限度的修改,以便它们按照您原来的方式进行操作,即获取 in 列表并以正确的顺序返回 in 列表。 我的一两个较慢的解决方案仅在长度是其块大小的倍数时才有效,我没有费心让它们适用于所有长度,因为它们无论如何都比较慢。

完整代码( 在线试用! ):

def Kelly_NumPy(i, l):
    a = np.frombuffer(bytes(l), np.int8)
    stop = a.size // i * i
    s = a[:stop:i]
    for j in range(1, i):
        s = (s << 1) | a[j::i]
    return np.unique(s, return_counts=True)[1].tolist()


def Kelly_zip(i, l):
    ctr = Counter(zip(*[iter(l)]*i))
    return [v for k, v in sorted(ctr.items())]


def Kelly_bytes_slices1(i, l):
    a = bytes(l)
    slices = [a[j:j+i] for j in range(0, len(a)//i*i, i)]
    ctr = Counter(slices)
    return [v for k, v in sorted(ctr.items())]


def Kelly_bytes_slices2(i, l):
    a = bytes(l)
    ig = itemgetter(*(slice(j, j+i) for j in range(0, 1000*i, i)))
    ctr = Counter(chain.from_iterable(
        ig(a[k:k+1000*i])
        for k in range(0, len(l), 1000*i)
    ))
    return [v for k, v in sorted(ctr.items())]


def Kelly_bytes_count(i, l):
    n = len(l)
    a = bytes(l)
    b = bytearray([2]) * (n + n//i)
    for j in range(i):
        b[j+1::i+1] = a[j::i]
    a = b
    ss = [bytes([2])]
    for _ in range(i):
        ss = [s+b for s in ss for b in [bytes([0]), bytes([1])]]
    return [a.count(s) for s in ss]


def Kelly_bytes_split1(i, l):
    n = len(l) // i
    stop = n * i
    a = bytes(l)
    sep = bytearray([2])
    b = sep * (stop + n - 1)
    for j in range(i):
        b[j::i+1] = a[j::i]
    ctr = Counter(bytes(b).split(sep))
    return [v for k, v in sorted(ctr.items())]


def Kelly_bytes_split2(i, l):
    n = len(l) // i
    stop = n * i
    a = bytes(l)
    sep = bytearray([2])
    b = sep * (5000*i + 4999)
    ctr = Counter()
    for k in range(0, stop, 5000*i):
        for j in range(i):
            b[j::i+1] = a[k+j:k+5000*i+j:i]
        ctr.update(bytes(b).split(sep))
    return [v for k, v in sorted(ctr.items())]


def original(i,l):
    grouped = [str(l[k:k + i]) for k in range(0, len(l), i)] 
    if len(grouped[len(grouped) - 1]) != len(grouped[0]):
        grouped.pop(len(grouped) - 1)
    grouped_sort = sorted(grouped) 
    grouped_sort_values = Counter(grouped_sort).values() 
    freq_list = list(grouped_sort_values)
    return freq_list


def Paul_genzip(subseq_len, sequence):
    ctr = Counter(subseq for subseq in zip(*[iter(sequence)] * subseq_len))
    return [v for k, v in sorted(ctr.items())]


def constantstranger(i,l):
    freq_list = [0] * 2 ** i
    binaryTupToInt = {binTup:j for j, binTup in enumerate(product((0,1),repeat=i))}
    c = Counter(binaryTupToInt[tuple(l[k:k+i])] for k in range(0, len(l) // i * i, i))
    for k, v in c.items():
        freq_list[k] = v
    return freq_list


def Colim_numpy(i: int, l):
    l = np.array(l)
    iBits = l[:i*(l.size//i)].reshape(-1, i)@(2**np.arange(i-1,-1,-1).T)
    unique, counts = np.unique(iBits, return_counts=True)
    return counts.tolist()


funcs = [
    original,
    Colim_numpy,
    Paul_genzip,
    constantstranger,
    Kelly_NumPy,
    Kelly_bytes_count,
    Kelly_zip,
    Kelly_bytes_slices1,
    Kelly_bytes_slices2,
    Kelly_bytes_split1,
    Kelly_bytes_split2,
]

from time import time
import os
from collections import Counter
from itertools import repeat, chain, product
import numpy as np
from operator import itemgetter 
from statistics import mean, stdev

n = 10**6
i = 2

times = {f: [] for f in funcs}
def stats(f):
    ts = [t/n*1e8 for t in sorted(times[f])[:3]]
    return f'{mean(ts):4.1f} s ± {stdev(ts):3.1f} s '

for _ in range(10):
    l = [b % 2 for b in os.urandom(n)]
    expect = None
    for f in funcs:
        t = time()
        result = f(i, l)
        t = time() - t
        times[f].append(t)
        if expect is None:
             expect = result
        else:
            assert result == expect

for f in sorted(funcs, key=stats):
    print(stats(f), f.__name__,)

这要快得多。 It uses Kelly's idea of using numpy.frombuffer instead of converting the list to numpy array, and uses Pandas to count unique values, which is faster than numpy.unique for more than 100 000 results

import pandas as pd

def subseq_counter(i: int, l):
    l = np.frombuffer(bytes(l), np.int8)
    iBits = l[:i*(l.size//i)].reshape(-1, i)@(2 **np.arange(i-1, -1, -1).T).astype(np.int8)
    # bug fix: when not enough data, (higly probable for large i),
    # iBits do not has every possible value, so returning unique values
    # as list may lose information
    answer = [0]*2**i  # empty counter including all possible values
    if len(iBits) > 100000:
        for i, v in pd.value_counts(iBits).items():
            answer[i] = v
    else:
        unique, count = np.unique(iBits, return_counts=True)
        for i, v in zip(unique, count):
            answer[i] = v
    return answer

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM