简体   繁体   English

在Python中将2d二进制列表转换为十进制数的算法

[英]Algorithm to convert a 2d binary list to decimal numbers in Python

I ran into this performance issue when I tried to convert a huge 2D list of binary data to decimal numbers. 当我试图将巨大的二进制数据2D列表转换为十进制数时,我遇到了这个性能问题。

Given a list: 给出一个清单:

biglist = [
  [[1,0],[0,0],[1,1]],
  [[0,0],[1,1],[1,0]],
  #...
  #easily go to thousands of rows
]

In each row, I want combine all first element of each column and convert it into a decimal number: 在每一行中,我想要组合每列的所有第一个元素并将其转换为十进制数:

Eg 例如

In row 0, I want int('101',2) ie 5 在第0行,我想要int('101',2)5

In row 1, I want int('011',2) ie 3 在第1行中,我想要int('011',2)3

My final goal is to create a dictionary that counts what integer appears how many times. 我的最终目标是创建一个字典,计算整数出现多少次。 Considering the given data in the example above, the final result should be a dictionary with {key:value} pair like {a_int : appearance_count} like this: 考虑到上面示例中的给定数据,最终结果应该是具有{key:value}对的字典,如{a_int : appearance_count}如下所示:

{{5:1},{3:1}}

Now my solution is this: 现在我的解决方案是:

result = {}
for k in biglist:
    num = int("".join(str(row[0]) for row in k), 2)
    #count
    if num not in result:
        result[num] = 1
    else:
        result[num] += 1

This loop is slow for a list of thousands of rows, is there a better solution? 这个循环对于数千行的列表来说很慢,是否有更好的解决方案?

First, using string to int conversion with join is convenient, but slow. 首先,使用字符串为int的转换与join很方便,但速度缓慢。 Compute the value from the powers of 2 classicaly, using sum , enumerate and bit shift on the ones (skip the zeroes) 计算2经典的幂的值,使用sumenumerate和位移(跳过零)

Second, you should use collections.Counter for this 其次,你应该使用collections.Counter

In one line: 在一行中:

result = collections.Counter(sum(v[0]<<(len(k)-i-1) for i,v in enumerate(k) if v[0]) for k in biglist)

this code runs 30% faster as your original code on my machine. 这段代码比我机器上的原始代码快30%。

Just collect bits of integer value instead of string-int transformations: (pseudocode) 只收集整数值的位而不是string-int转换:(伪代码)

for every row:
    value = 0
    for every col:
        value = (value << 1) | biglist[row][col][0]  # bitwise shift left and OR

   #equivalent  operation:
        value = value * 2 +  biglist[row][col][0]  

If you need performance, you should use numpy or numba, which all low level routines are done at nearly C speed : 如果你需要性能,你应该使用numpy或numba,所有低级例程都以近乎C的速度完成:

import numpy as np
bigarray=np.random.randint(0,2,10**4*3*2).reshape(10**4,3,2)
biglist=[[[e for e in B] for B in A] for A in bigarray]
# [[[1, 0], [0, 0], [1, 0]],
#  [[0, 0], [0, 1], [0, 1]], 
#  [[1, 0], [1, 0], [0, 0]], ...

def your_count(biglist):
    integers=[]
    for k in biglist:
        num = int("".join(str(row[0]) for row in k), 2)
        integers.append(num)
    return integers

def count_python(big):
    m=len(big)
    integers=np.empty(m,np.int32)
    for i in range(m):
        n=len(big[i])
        b=1
        s=0
        for j in range(n-1,-1,-1):
               s = s+big[i][j][0]*b
               b=b*2
        integers[i]=s
    return integers

def count_numpy(bigarray): 
integers=(bigarray[:,:,0]*[4,2,1]).sum(axis=1)
return integers

from numba import njit    
count_numba =njit(count_python)

And some tests: 还有一些测试:

In [125]: %timeit your_count(biglist)
145 ms ± 22.1 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [126]: %timeit count_python(biglist)
29.6 ms ± 1.13 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [127]: %timeit count_numpy(bigarray)
354 µs ± 10.2 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [128]: %timeit count_numba(bigarray)
73 µs ± 938 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

Numba allow you to compile a low level version of some python codes (not yours because Numba don't manage strings and list, only numpy arrays). Numba允许你编译一些python代码的低级版本(不是你的,因为Numba不管理字符串和列表,只有numpy数组)。 Numpy give you special syntax to make fantastic things in one instruction, for good performances. Numpy为您提供特殊的语法,可以在一条指令中制作出精彩的表演,以获得良好的表现

The Numba solution is here 2000x faster than yours. Numba解决方案比你的快2000倍。

The counts are efficiently computed by collections.Counter or np.unique : 计数由collections.Counternp.unique有效计算:

In [150]: %timeit {k:v for k,v in zip(*np.unique(integers,return_counts=True))} 
46.4 µs ± 1.55 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

In [151]: %timeit Counter(integers)
218 µs ± 11.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM