简体   繁体   中英

Element wise operation on nested numpy array

Background

I have a nested numpy array and I want to:

  1. First, add a different random value to each scaler element of nested numpy array
  2. And then, delete the value larger than 10.

...

[[1, 2, 3], [4, 5], [6, 7, 8]] 
#(add random value for each scaler element)
[5.5, 6.7, 8.2], [4.1, -3.0], [**16**, -2, 7]] 
# (remove elements larger than 10)
[5.5, 6.7, 8.2], [4.1, -3.0], [-2, 7]]

Code:

original_nested_array = np.array([np.array([1,2,3]),np.array([1,2]),np.array([3,2,1])], dtype = object)

# add a random value on each minimum element of original_nested_array
...
# Delete elements larger than fixed value, e.g. 10
...

The point is that my nested array has elements with different length.

In the example above, the first element has length == 3, second has length == 2, third has length == 3. Thus, original_nested_array.shape equals to (3,) instead of (3,3) , which is harder for elementwise or broadcasting operation.

You can compute the length of each parts, compute the offset of each sections (number of item preceding the current item in a flatten representation), merge the parts with np.concatenate , add random number using a simple sum with np.random.randn , find the location of the maximum with np.argmax , delete the element of the flatten array and update the the section offsets before splitting the array with np.split :

len_of_parts = np.fromiter(map(len, original_nested_array), dtype=int)
part_sections = len_of_parts.cumsum()
all_values = np.concatenate(original_nested_array).astype(np.float64)
all_values += np.random.randn(all_values.size)
max_index = all_values.argmax()
all_values = np.delete(all_values, max_index)
part_sections[np.searchsorted(part_sections, max_index, 'right'):] -= 1
output = np.split(all_values, part_sections[:-1])

However, please do not use jagged array . They are clearly not efficient. Numpy is not design to manipulate them efficiently nor easily, In fact. the overhead of Numpy function is mostly multiplied by the number of items in the jagged array, Thus. a jagged array of 1000 items containing subitems of an average size of 10 can be up to 1000 time slower to compute than one big flatten array (it is about 200 times slower on my machine in this case), In fact. using Python list is likely much faster in such a case (but still inefficient compared to a big array).

The efficient solution is to flatten jagged array and keep an array of start-end sections defining sub-arrays. This is especially much faster if you use Cython or Numba to compute operations that can be hardly done with Numpy.

Also note that delete operations are slow since a now array needs to be created (and almost copied). This is fine to use np.delete as long as it is not done in a loop (at least not a critical one). Otherwise, the complexity can become much worse with this call.

from functools import partial
import numpy as np
import pandas as pd
from typing import Union

class NPTools:
    def __init__(self, nparray:Union[np.array, np.ndarray, list,tuple]):
        if isinstance(nparray, (list,tuple)):
            nparray = np.array(nparray)
        self.nparray = nparray

    def bb_get_size_of_biggest_element_in_list(self, lst: iter):
        return len(max(lst, key=len))

    def aa_adjust_2d_numpy_array(self, fillvalue=np.nan, dtype=np.object_):
        maxlen = self.bb_get_size_of_biggest_element_in_list(self.nparray)
        ajustedlists = np.array(
            [
                *(
                    np.fromiter(x + (maxlen - len(x)) * [fillvalue], dtype=dtype)
                    for x in self.nparray
                )
            ]
        )
        self.nparray = ajustedlists
        return self

    def aa_sort_2d_numpy_array(self):
        self.nparray = np.array([np.sort(lst) for lst in self.nparray])
        return self

    def bb_create_random_value_array_of_same_shape_with_int(
        self, startvalue: int, stopvalue: int
    ) -> np.array:
        return np.random.randint(startvalue, stopvalue, self.nparray.shape)

    def bb_create_zero_array_of_same_shape(self) -> np.array:
        return np.zeros(self.nparray.shape)

    def bb_slice_vertical(
        self, start_index: int = 0, stop_index: Union[int, None] = None
    ) ->np.array:
        """[[2 2 3]
             [4 5 nan]
             [6 7 8]] -> [2 4 6]
        bb_slice_vertical(1,2)
        second_part = sorted_array[0:,1:2]
        rest : bb_slice_vertical = bb_slice_vertical[0:,1:]

        """
        if stop_index is None:
            return self.nparray[0:, start_index:]
        return self.nparray[0:, start_index:stop_index]

    def bb_delete_all_na_in_2d_array(self):
        return np.fromiter(
            map(
                lambda arraya: np.array([x for x in arraya if pd.notna(x)]),
                self.nparray,
            ),
            dtype=self.nparray.dtype,
        )

    def aa_apply_vetor_function(self, function, arguments=None):
        if arguments is not None:
            applyfunction = partial(function, *arguments)
        else:
            applyfunction = partial(function)

        oct_array = np.frompyfunc(applyfunction, 1, 1)
        self.nparray = oct_array(self.nparray)
        return self

    def aa_sort_array(self):
        self.nparray = np.sort(self.nparray)
        return self

    @staticmethod
    def cc_delete_elements_if_smaller(comparevalue:Union[int,float], value:Union[int,float]) ->Union[int,float]:
        if pd.isna(value):
            return np.nan
        if value < comparevalue:
            return np.nan
        return value

    @staticmethod
    def cc_delete_elements_if_bigger(comparevalue:Union[int,float], value:Union[int,float])->Union[int,float]:
        if pd.isna(value):
            return np.nan
        if value > comparevalue:
            return np.nan
        return value

randomlist = [
    [3, 2, 2], [4, 5], [6, 8, 7] ]

nptools_ = NPTools(np.array(randomlist))

vertical_slice = (nptools_.aa_adjust_2d_numpy_array(fillvalue=np.nan, dtype=np.object_) #ajust_array, so that all lists have the same length
                  .aa_sort_2d_numpy_array() #sort_all_arrays, so that the smallest element's index is 0 in all lists/arrays
                  .bb_slice_vertical(start_index= 0, stop_index=1)) # get_all_minimum values

print(f'{nptools_.nparray=}\n')

print(f'{vertical_slice=}\n')

vertical_slice += np.random.randint(0,10,vertical_slice.shape) # add some random numbers to the smallest number in each list

print(f'{nptools_.nparray=}\n')

nptools_.aa_apply_vetor_function(function=NPTools.cc_delete_elements_if_bigger, arguments=[10]) #return np.nan if the number is bigger than 10

print(f'{nptools_.nparray=}\n')

finalarray = nptools_.bb_delete_all_na_in_2d_array() #get rid of all np.nans, so that only the values bigger than 10 are left over

print(f'{finalarray=}\n')


output:

nptools_.nparray=array([[2, 2, 3],
       [4, 5, nan],
       [6, 7, 8]], dtype=object)
vertical_slice=array([[2],
       [4],
       [6]], dtype=object)
nptools_.nparray=array([[10, 2, 3],
       [11, 5, nan],
       [13, 7, 8]], dtype=object)
nptools_.nparray=array([[10, 2, 3],
       [nan, 5, nan],
       [nan, 7, 8]], dtype=object)
finalarray=array([array([10,  2,  3]), array([5]), array([7, 8])], dtype=object)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM