[英]How to find Run length encoding in python
我有一个数组ar = [2,2,2,1,1,2,2,3,3,3,3]
。 对于这个数组,我想找到连续相同数字的长度,例如:
values: 2, 1, 2, 3
lengths: 3, 2, 2, 4
在R
,这是通过使用rle()
函数获得的。 python中是否有任何现有函数提供所需的输出?
你可以用groupby
做到这一点
In [60]: from itertools import groupby
In [61]: ar = [2,2,2,1,1,2,2,3,3,3,3]
In [62]: print [(k, sum(1 for i in g)) for k,g in groupby(ar)]
[(2, 3), (1, 2), (2, 2), (3, 4)]
这是使用高性能pyrle库进行运行长度算法的答案:
# pip install pyrle
# (pyrle >= 0.0.25)
from pyrle import Rle
v = [2,2,2,1,1,2,2,3,3,3,3]
r = Rle(v)
print(r)
# +--------+-----+-----+-----+-----+
# | Runs | 3 | 2 | 2 | 4 |
# |--------+-----+-----+-----+-----|
# | Values | 2 | 1 | 2 | 3 |
# +--------+-----+-----+-----+-----+
# Rle of length 11 containing 4 elements
print(r[4])
# 1.0
print(r[4:7])
# +--------+-----+-----+
# | Runs | 1 | 2 |
# |--------+-----+-----|
# | Values | 1.0 | 2.0 |
# +--------+-----+-----+
# Rle of length 3 containing 2 elements
r + r + 0.5
# +--------+-----+-----+-----+-----+
# | Runs | 3 | 2 | 2 | 4 |
# |--------+-----+-----+-----+-----|
# | Values | 4.5 | 2.5 | 4.5 | 6.5 |
# +--------+-----+-----+-----+-----+
# Rle of length 11 containing 4 elements
这是纯 numpy 的答案:
import numpy as np
def find_runs(x):
"""Find runs of consecutive items in an array."""
# ensure array
x = np.asanyarray(x)
if x.ndim != 1:
raise ValueError('only 1D array supported')
n = x.shape[0]
# handle empty array
if n == 0:
return np.array([]), np.array([]), np.array([])
else:
# find run starts
loc_run_start = np.empty(n, dtype=bool)
loc_run_start[0] = True
np.not_equal(x[:-1], x[1:], out=loc_run_start[1:])
run_starts = np.nonzero(loc_run_start)[0]
# find run values
run_values = x[loc_run_start]
# find run lengths
run_lengths = np.diff(np.append(run_starts, n))
return run_values, run_starts, run_lengths
这是使用 numpy 数组的优化答案,如果运行长度很长,它会快速运行。
在这种情况下,我想使用 16 位无符号整数运行长度编码对一个 uint16 数组进行编码,该数组可以远大于2**16
。 为了允许这一点,数组被“分块”,因此索引永远不会超过2**16
:
import numpy as np
def run_length_encode(array, chunksize=((1 << 16) - 1), dtype=np.int16):
"Chunked run length encoding for very large arrays containing smallish values."
shape = array.shape
ravelled = array.ravel()
length = len(ravelled)
chunk_cursor = 0
runlength_chunks = []
while chunk_cursor < length:
chunk_end = chunk_cursor + chunksize
chunk = ravelled[chunk_cursor : chunk_end]
chunk_length = len(chunk)
change = (chunk[:-1] != chunk[1:])
change_indices = np.nonzero(change)[0]
nchanges = len(change_indices)
cursor = 0
runlengths = np.zeros((nchanges + 1, 2), dtype=dtype)
for (count, index) in enumerate(change_indices):
next_cursor = index + 1
runlengths[count, 0] = chunk[cursor] # value
runlengths[count, 1] = next_cursor - cursor # run length
cursor = next_cursor
# last run
runlengths[nchanges, 0] = chunk[cursor]
runlengths[nchanges, 1] = chunk_length - cursor
runlength_chunks.append(runlengths)
chunk_cursor = chunk_end
all_runlengths = np.vstack(runlength_chunks).astype(dtype)
description = dict(
shape=shape,
runlengths=all_runlengths,
dtype=dtype,
)
return description
def run_length_decode(description):
dtype = description["dtype"]
runlengths = description["runlengths"]
shape = description["shape"]
array = np.zeros(shape, dtype=dtype)
ravelled = array.ravel()
cursor = 0
for (value, size) in runlengths:
run_end = cursor + size
ravelled[cursor : run_end] = value
cursor = run_end
array = ravelled.reshape(shape) # redundant?
return array
def testing():
A = np.zeros((50,), dtype=np.uint16)
A[20:30] = 10
A[30:35] = 6
A[40:] = 3
test = run_length_encode(A, chunksize=17)
B = run_length_decode(test)
assert np.alltrue(A == B)
print ("ok!")
if __name__=="__main__":
testing()
我为一个与对小鼠胚胎的显微镜图像进行分类有关的项目构建了这个。
https://github.com/flatironinstitute/mouse_embryo_labeller
注意:我发现我必须将这一行中的类型强制转换为它才能用于大型数组后,我编辑了该条目:
all_runlengths = np.vstack(runlength_chunks).astype(dtype)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.