[英]Count of the number of identical values in two arrays for all the unique values in an array
[英]Count number of identical values in a column within a numpy array
我正在尋找以下問題的解決方案:
假設我有一個有形狀的數組(4,4):
[5. 4. 5. 4.]
[2. 3. 5. 5.]
[2. 1. 5. 1.]
[1. 3. 1. 3.]
在該數組中有一列,其中值“5”連續出現3次。 也就是說,它們不會分散在整個色譜柱上,如下所示。
[5.] # This
[1.] # Should
[5.] # Not
[5.] # Count
現在假設我有一個更大的數組,其形狀(M,N)和各種整數值在1-5的相同范圍內。 我如何計算每列中出現的相同值的最大數量? 此外,是否有可能獲得這些值出現的指數? 上面例子的預期輸出是
Found 3 in a row of number 5 in column 2
(0,2), (1,2), (2,2)
我假設如果搜索應涉及行,則實現將類似。 如果不是,我也很想知道如何做到這一點。
方法#1
這是一種方法 -
def find_longest_island_indices(a, values):
b = np.pad(a, ((1,1),(0,0)), 'constant')
shp = np.array(b.shape)[::-1] - [0,1]
maxlens = []
final_out = []
for v in values:
m = b==v
idx = np.flatnonzero((m[:-1] != m[1:]).T)
s0,s1 = idx[::2], idx[1::2]
l = s1-s0
maxidx = l.argmax()
longest_island_flatidx = np.r_[s0[maxidx]:s1[maxidx]]
r,c = np.unravel_index(longest_island_flatidx, shp)
final_out.append(np.c_[c,r])
maxlens.append(l[maxidx])
return maxlens, final_out
樣品運行 -
In [169]: a
Out[169]:
array([[5, 4, 5, 4],
[2, 3, 5, 5],
[2, 1, 5, 1],
[1, 3, 1, 3]])
In [173]: maxlens
Out[173]: [1, 2, 1, 1, 3]
In [174]: out
Out[174]:
[array([[3, 0]]), array([[1, 0],
[2, 0]]), array([[1, 1]]), array([[0, 1]]), array([[0, 2],
[1, 2],
[2, 2]])]
# With "pretty" printing
In [171]: maxlens, out = find_longest_island_indices(a, [1,2,3,4,5])
...: for l,o,i in zip(maxlens,out,[1,2,3,4,5]):
...: print "For "+str(i)+" : L= "+str(l)+", Idx = "+str(o.tolist())
For 1 : L= 1, Idx = [[3, 0]]
For 2 : L= 2, Idx = [[1, 0], [2, 0]]
For 3 : L= 1, Idx = [[1, 1]]
For 4 : L= 1, Idx = [[0, 1]]
For 5 : L= 3, Idx = [[0, 2], [1, 2], [2, 2]]
方法#2
通過一些修改並輸出最大長度島的開始和結束指數,這里是一個 -
def find_longest_island_indices_v2(a, values):
b = np.pad(a.T, ((0,0),(1,1)), 'constant')
shp = b.shape
out = []
for v in values:
m = b==v
idx = np.flatnonzero(m.flat[:-1] != m.flat[1:])
s0,s1 = idx[::2], idx[1::2]
l = s1-s0
maxidx = l.argmax()
start_index = np.unravel_index(s0[maxidx], shp)[::-1]
end_index = np.unravel_index(s1[maxidx]-1, shp)[::-1]
maxlen = l[maxidx]
out.append([v,maxlen, start_index, end_index])
return out
樣品運行 -
In [251]: a
Out[251]:
array([[5, 4, 5, 4],
[2, 3, 5, 5],
[2, 1, 5, 1],
[1, 3, 1, 3]])
In [252]: out = find_longest_island_indices_v2(a, [1,2,3,4,5])
In [255]: out
Out[255]:
[[1, 1, (3, 0), (3, 0)],
[2, 2, (1, 0), (2, 0)],
[3, 1, (1, 1), (1, 1)],
[4, 1, (0, 1), (0, 1)],
[5, 3, (0, 2), (2, 2)]]
# With some pandas styled printing
In [253]: import pandas as pd
In [254]: pd.DataFrame(out, columns=['Val','MaxLen','StartIdx','EndIdx'])
Out[254]:
Val MaxLen StartIdx EndIdx
0 1 1 (3, 0) (3, 0)
1 2 2 (1, 0) (2, 0)
2 3 1 (1, 1) (1, 1)
3 4 1 (0, 1) (0, 1)
4 5 3 (0, 2) (2, 2)
如果我們在變量的列中存儲一組相同值的最大長度,那么我們可以迭代查找更長的運行。
如果以下需要更多解釋,那就說吧!
a = np.array([[5,4,5,4],[2,3,5,5],[2,1,5,1],[1,3,1,3]])
rows, cols = a.shape
max_length = 0
for ci in range(cols):
for ri in range(rows):
if ri == 0: #start of run
start_pos = (ri, ci)
length = 1
elif a[ri,ci] == a[ri-1,ci]: #during run
length += 1
else: #end of run
if length > max_length:
max_length = length
max_pos = start_pos
max_row, max_col = max_pos
print('Found {} in a row of number {} in column {}'.format(max_length, a[max_pos], max_col))
for i in range(max_length):
print((max_row+i, max_col))
輸出:
Found 3 in a row of number 5 in column 2
(0, 2)
(1, 2)
(2, 2)
請注意,如果您希望元組的輸出采用您所述的確切格式,那么您可以使用帶有str.join
的generator-expression:
print((max_row+i, max_col) for i in range(max_length)
另一種方法是使用@user建議的itertools.groupby ,可能的實現如下:
import numpy as np
from itertools import groupby
def runs(column):
max_run_length, start, indices, max_value = -1, 0, 0, 0
for val, run in groupby(column):
run_length = sum(1 for _ in run)
if run_length > max_run_length:
max_run_length, start, max_value = run_length, indices, val
indices += run_length
return max_value, max_run_length, start
上面的函數計算給定列(行)的最大運行長度,開始值和相應值。 使用這些值,您可以計算預期輸出。 對於數組[5., 5., 5., 1.]
,groupby是完成所有繁重任務的組合,
[(val, sum(1 for _ in run)) for val, run in groupby([5., 5., 5., 1.])]
前一行輸出: [(5.0, 3), (1.0, 1)]
。 循環保持最大運行的起始索引,長度和值。 要將該函數應用於列,可以使用numpy.apply_along_axis :
data = np.array([[5., 4., 5., 4.],
[2., 3., 5., 5.],
[2., 1., 5., 1.],
[1., 3., 1., 3.]])
result = [tuple(row) for row in np.apply_along_axis(runs, 0, data).T]
print(result)
產量
[(2.0, 2.0, 1.0), (4.0, 1.0, 0.0), (5.0, 3.0, 0.0), (4.0, 1.0, 0.0)]
在上面的輸出中,第四元組對應於第四列,最長連續運行的值是5
,長度是3
並且從索引0
開始。 要更改為行而不是列,請將軸的索引更改為1
並刪除T,如下所示:
result = [tuple(row) for row in np.apply_along_axis(runs, 1, data)]
產量
[(5.0, 1.0, 0.0), (5.0, 2.0, 2.0), (2.0, 1.0, 0.0), (1.0, 1.0, 0.0)]
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.