简体   繁体   English

嵌套 numpy 数组上的元素明智操作

[英]Element wise operation on nested numpy array

Background背景

I have a nested numpy array and I want to:我有一个嵌套的 numpy 数组,我想:

  1. First, add a different random value to each scaler element of nested numpy array首先,为嵌套的 numpy 数组的每个缩放器元素添加不同的随机值
  2. And then, delete the value larger than 10.然后,删除大于 10 的值。

... ...

[[1, 2, 3], [4, 5], [6, 7, 8]] 
#(add random value for each scaler element)
[5.5, 6.7, 8.2], [4.1, -3.0], [**16**, -2, 7]] 
# (remove elements larger than 10)
[5.5, 6.7, 8.2], [4.1, -3.0], [-2, 7]]

Code:代码:

original_nested_array = np.array([np.array([1,2,3]),np.array([1,2]),np.array([3,2,1])], dtype = object)

# add a random value on each minimum element of original_nested_array
...
# Delete elements larger than fixed value, e.g. 10
...

The point is that my nested array has elements with different length.关键是我的嵌套数组具有不同长度的元素。

In the example above, the first element has length == 3, second has length == 2, third has length == 3. Thus, original_nested_array.shape equals to (3,) instead of (3,3) , which is harder for elementwise or broadcasting operation.在上面的示例中,第一个元素的长度 == 3,第二个元素的长度 == 2,第三个元素的长度 == 3。因此, original_nested_array.shape等于(3,)而不是(3,3) ,这更难用于元素或广播操作。

You can compute the length of each parts, compute the offset of each sections (number of item preceding the current item in a flatten representation), merge the parts with np.concatenate , add random number using a simple sum with np.random.randn , find the location of the maximum with np.argmax , delete the element of the flatten array and update the the section offsets before splitting the array with np.split :您可以计算每个部分的长度,计算每个部分的偏移量(在展平表示中当前项目之前的项目数),将部分与np.concatenate合并,使用np.random.randn的简单和添加随机数,使用np.argmax找到最大值的位置,删除扁平数组的元素并在使用np.split拆分数组之前更新部分偏移量:

len_of_parts = np.fromiter(map(len, original_nested_array), dtype=int)
part_sections = len_of_parts.cumsum()
all_values = np.concatenate(original_nested_array).astype(np.float64)
all_values += np.random.randn(all_values.size)
max_index = all_values.argmax()
all_values = np.delete(all_values, max_index)
part_sections[np.searchsorted(part_sections, max_index, 'right'):] -= 1
output = np.split(all_values, part_sections[:-1])

However, please do not use jagged array .但是,请不要使用锯齿状数组 They are clearly not efficient.他们显然没有效率。 Numpy is not design to manipulate them efficiently nor easily, In fact.事实上,Numpy 的设计不是为了有效地操作它们,也不是很容易地操作它们。 the overhead of Numpy function is mostly multiplied by the number of items in the jagged array, Thus. Numpy function 的开销大部分乘以锯齿状数组中的项目数,因此。 a jagged array of 1000 items containing subitems of an average size of 10 can be up to 1000 time slower to compute than one big flatten array (it is about 200 times slower on my machine in this case), In fact.实际上,包含 1000 个项目的锯齿状数组包含平均大小为 10 的子项,其计算速度可能比一个大的扁平数组慢 1000 倍(在这种情况下,它在我的机器上大约慢 200 倍)。 using Python list is likely much faster in such a case (but still inefficient compared to a big array).在这种情况下,使用 Python 列表可能要快得多(但与大数组相比仍然效率低下)。

The efficient solution is to flatten jagged array and keep an array of start-end sections defining sub-arrays.有效的解决方案是展平锯齿状数组并保留定义子数组的起始部分数组。 This is especially much faster if you use Cython or Numba to compute operations that can be hardly done with Numpy.如果您使用 Cython 或 Numba 来计算 Numpy 几乎无法完成的操作,这尤其快得多。

Also note that delete operations are slow since a now array needs to be created (and almost copied).另请注意,删除操作很慢,因为需要创建 now 数组(并且几乎要复制)。 This is fine to use np.delete as long as it is not done in a loop (at least not a critical one).只要不是在循环中完成(至少不是关键循环),就可以使用np.delete Otherwise, the complexity can become much worse with this call.否则,这个调用的复杂性会变得更糟。

from functools import partial
import numpy as np
import pandas as pd
from typing import Union

class NPTools:
    def __init__(self, nparray:Union[np.array, np.ndarray, list,tuple]):
        if isinstance(nparray, (list,tuple)):
            nparray = np.array(nparray)
        self.nparray = nparray

    def bb_get_size_of_biggest_element_in_list(self, lst: iter):
        return len(max(lst, key=len))

    def aa_adjust_2d_numpy_array(self, fillvalue=np.nan, dtype=np.object_):
        maxlen = self.bb_get_size_of_biggest_element_in_list(self.nparray)
        ajustedlists = np.array(
            [
                *(
                    np.fromiter(x + (maxlen - len(x)) * [fillvalue], dtype=dtype)
                    for x in self.nparray
                )
            ]
        )
        self.nparray = ajustedlists
        return self

    def aa_sort_2d_numpy_array(self):
        self.nparray = np.array([np.sort(lst) for lst in self.nparray])
        return self

    def bb_create_random_value_array_of_same_shape_with_int(
        self, startvalue: int, stopvalue: int
    ) -> np.array:
        return np.random.randint(startvalue, stopvalue, self.nparray.shape)

    def bb_create_zero_array_of_same_shape(self) -> np.array:
        return np.zeros(self.nparray.shape)

    def bb_slice_vertical(
        self, start_index: int = 0, stop_index: Union[int, None] = None
    ) ->np.array:
        """[[2 2 3]
             [4 5 nan]
             [6 7 8]] -> [2 4 6]
        bb_slice_vertical(1,2)
        second_part = sorted_array[0:,1:2]
        rest : bb_slice_vertical = bb_slice_vertical[0:,1:]

        """
        if stop_index is None:
            return self.nparray[0:, start_index:]
        return self.nparray[0:, start_index:stop_index]

    def bb_delete_all_na_in_2d_array(self):
        return np.fromiter(
            map(
                lambda arraya: np.array([x for x in arraya if pd.notna(x)]),
                self.nparray,
            ),
            dtype=self.nparray.dtype,
        )

    def aa_apply_vetor_function(self, function, arguments=None):
        if arguments is not None:
            applyfunction = partial(function, *arguments)
        else:
            applyfunction = partial(function)

        oct_array = np.frompyfunc(applyfunction, 1, 1)
        self.nparray = oct_array(self.nparray)
        return self

    def aa_sort_array(self):
        self.nparray = np.sort(self.nparray)
        return self

    @staticmethod
    def cc_delete_elements_if_smaller(comparevalue:Union[int,float], value:Union[int,float]) ->Union[int,float]:
        if pd.isna(value):
            return np.nan
        if value < comparevalue:
            return np.nan
        return value

    @staticmethod
    def cc_delete_elements_if_bigger(comparevalue:Union[int,float], value:Union[int,float])->Union[int,float]:
        if pd.isna(value):
            return np.nan
        if value > comparevalue:
            return np.nan
        return value

randomlist = [
    [3, 2, 2], [4, 5], [6, 8, 7] ]

nptools_ = NPTools(np.array(randomlist))

vertical_slice = (nptools_.aa_adjust_2d_numpy_array(fillvalue=np.nan, dtype=np.object_) #ajust_array, so that all lists have the same length
                  .aa_sort_2d_numpy_array() #sort_all_arrays, so that the smallest element's index is 0 in all lists/arrays
                  .bb_slice_vertical(start_index= 0, stop_index=1)) # get_all_minimum values

print(f'{nptools_.nparray=}\n')

print(f'{vertical_slice=}\n')

vertical_slice += np.random.randint(0,10,vertical_slice.shape) # add some random numbers to the smallest number in each list

print(f'{nptools_.nparray=}\n')

nptools_.aa_apply_vetor_function(function=NPTools.cc_delete_elements_if_bigger, arguments=[10]) #return np.nan if the number is bigger than 10

print(f'{nptools_.nparray=}\n')

finalarray = nptools_.bb_delete_all_na_in_2d_array() #get rid of all np.nans, so that only the values bigger than 10 are left over

print(f'{finalarray=}\n')


output:

nptools_.nparray=array([[2, 2, 3],
       [4, 5, nan],
       [6, 7, 8]], dtype=object)
vertical_slice=array([[2],
       [4],
       [6]], dtype=object)
nptools_.nparray=array([[10, 2, 3],
       [11, 5, nan],
       [13, 7, 8]], dtype=object)
nptools_.nparray=array([[10, 2, 3],
       [nan, 5, nan],
       [nan, 7, 8]], dtype=object)
finalarray=array([array([10,  2,  3]), array([5]), array([7, 8])], dtype=object)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM