繁体   English   中英

如何在python中找到运行长度编码

[英]How to find Run length encoding in python

我有一个数组ar = [2,2,2,1,1,2,2,3,3,3,3] 对于这个数组,我想找到连续相同数字的长度,例如:

 values: 2, 1, 2, 3
lengths: 3, 2, 2, 4

R ,这是通过使用rle()函数获得的。 python中是否有任何现有函数提供所需的输出?

你可以用groupby做到这一点

In [60]: from itertools import groupby
In [61]: ar = [2,2,2,1,1,2,2,3,3,3,3]
In [62]: print [(k, sum(1 for i in g)) for k,g in groupby(ar)]
[(2, 3), (1, 2), (2, 2), (3, 4)]

这是使用高性能pyrle库进行运行长度算法的答案:

# pip install pyrle
# (pyrle >= 0.0.25)

from pyrle import Rle

v = [2,2,2,1,1,2,2,3,3,3,3]

r = Rle(v)
print(r)
# +--------+-----+-----+-----+-----+
# | Runs   | 3   | 2   | 2   | 4   |
# |--------+-----+-----+-----+-----|
# | Values | 2   | 1   | 2   | 3   |
# +--------+-----+-----+-----+-----+
# Rle of length 11 containing 4 elements

print(r[4])
# 1.0

print(r[4:7])
# +--------+-----+-----+
# | Runs   | 1   | 2   |
# |--------+-----+-----|
# | Values | 1.0 | 2.0 |
# +--------+-----+-----+
# Rle of length 3 containing 2 elements

r + r + 0.5
# +--------+-----+-----+-----+-----+
# | Runs   | 3   | 2   | 2   | 4   |
# |--------+-----+-----+-----+-----|
# | Values | 4.5 | 2.5 | 4.5 | 6.5 |
# +--------+-----+-----+-----+-----+
# Rle of length 11 containing 4 elements

这是纯 numpy 的答案:

import numpy as np


def find_runs(x):
    """Find runs of consecutive items in an array."""

    # ensure array
    x = np.asanyarray(x)
    if x.ndim != 1:
        raise ValueError('only 1D array supported')
    n = x.shape[0]

    # handle empty array
    if n == 0:
        return np.array([]), np.array([]), np.array([])

    else:
        # find run starts
        loc_run_start = np.empty(n, dtype=bool)
        loc_run_start[0] = True
        np.not_equal(x[:-1], x[1:], out=loc_run_start[1:])
        run_starts = np.nonzero(loc_run_start)[0]

        # find run values
        run_values = x[loc_run_start]

        # find run lengths
        run_lengths = np.diff(np.append(run_starts, n))

        return run_values, run_starts, run_lengths

归功于https://github.com/alimanfoo

这是使用 numpy 数组的优化答案,如果运行长度很长,它会快速运行。

在这种情况下,我想使用 16 位无符号整数运行长度编码对一个 uint16 数组进行编码,该数组可以远大于2**16 为了允许这一点,数组被“分块”,因此索引永远不会超过2**16

import numpy as np

def run_length_encode(array, chunksize=((1 << 16) - 1), dtype=np.int16):
    "Chunked run length encoding for very large arrays containing smallish values."
    shape = array.shape
    ravelled = array.ravel()
    length = len(ravelled)
    chunk_cursor = 0
    runlength_chunks = []
    while chunk_cursor < length:
        chunk_end = chunk_cursor + chunksize
        chunk = ravelled[chunk_cursor : chunk_end]
        chunk_length = len(chunk)
        change = (chunk[:-1] != chunk[1:])
        change_indices = np.nonzero(change)[0]
        nchanges = len(change_indices)
        cursor = 0
        runlengths = np.zeros((nchanges + 1, 2), dtype=dtype)
        for (count, index) in enumerate(change_indices):
            next_cursor = index + 1
            runlengths[count, 0] = chunk[cursor] # value
            runlengths[count, 1] = next_cursor - cursor # run length
            cursor = next_cursor
        # last run
        runlengths[nchanges, 0] = chunk[cursor]
        runlengths[nchanges, 1] = chunk_length - cursor
        runlength_chunks.append(runlengths)
        chunk_cursor = chunk_end
    all_runlengths = np.vstack(runlength_chunks).astype(dtype)
    description = dict(
        shape=shape,
        runlengths=all_runlengths,
        dtype=dtype,
        )
    return description

def run_length_decode(description):
    dtype = description["dtype"]
    runlengths = description["runlengths"]
    shape = description["shape"]
    array = np.zeros(shape, dtype=dtype)
    ravelled = array.ravel()
    cursor = 0
    for (value, size) in runlengths:
        run_end = cursor + size
        ravelled[cursor : run_end] = value
        cursor = run_end
    array = ravelled.reshape(shape)  # redundant?
    return array

def testing():
    A = np.zeros((50,), dtype=np.uint16)
    A[20:30] = 10
    A[30:35] = 6
    A[40:] = 3
    test = run_length_encode(A, chunksize=17)
    B = run_length_decode(test)
    assert np.alltrue(A == B)
    print ("ok!")

if __name__=="__main__":
    testing()

我为一个与对小鼠胚胎的显微镜图像进行分类有关的项目构建了这个。

https://github.com/flatironinstitute/mouse_embryo_labeller

注意:我发现我必须将这一行中的类型强制转换为它才能用于大型数组后,我编辑了该条目:

all_runlengths = np.vstack(runlength_chunks).astype(dtype)

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM