简体   繁体   English

将函数应用于列表的numpy数组中的每个列表

[英]Apply function to each list in numpy array of lists

I have a function that accepts a list (of strings). 我有一个接受(字符串)列表的函数。 It does some processing on that list and returns another list of strings, possibly of shorter length. 它对该列表进行一些处理,然后返回另一个可能较短的字符串列表。

Now, I have a numpy array of input lists of strings. 现在,我有一个numpy的字符串输入列表数组。 I want to apply this transformation function to each list in my array. 我想将此转换函数应用于数组中的每个列表。

From what searching I have done so far, it seemed like vectorize or apply_along_axis might be good candidates, but neither is working as expected. 从到目前为止的搜索结果来看,似乎vectorizeapply_along_axis可能是不错的选择,但都没有按预期工作。

I'd like to do this as efficiently as possible. 我想尽可能有效地做到这一点。 Ultimately the input array will contain on the order of 100K lists. 最终,输入数组将包含大约100K列表。

I suppose I could iterate over the numpy array in a for loop, then append each output list into a new output array one at a time, but that seems horribly inefficient. 我想我可以在for循环中遍历numpy数组,然后一次append每个输出列表append到一个新的输出数组中,但这似乎效率很低。

Here is what I've tried. 这是我尝试过的。 For testing purposes, I've made a dumbed down transformation function and the input array contains just 3 lists. 为了进行测试,我制作了一个愚蠢的转换函数,并且输入数组仅包含3个列表。

def my_func(l):
    # accepts list, returns another list
    # dumbed down list transformation function
    # for testing, just return the first 2 elems of original list
    return l[0:2]

test_arr = np.array([['the', 'quick', 'brown', 'fox'], ['lorem', 'ipsum'], ['this', 'is', 'a', 'test']])

np.apply_along_axis(my_func, 0, test_arr)
Out[51]: array([['the', 'quick', 'brown', 'fox'], ['lorem', 'ipsum']], dtype=object)

# Rather than applying item by item, this returns the first 2 elements of the entire outer array!!

# Expected:
# array([['the', 'quick'], ['lorem', 'ipsum'], ['this', 'is']])

# Attempt 2...

my_func_vec = np.vectorize(my_func)
my_func_vec(test_arr)

Result: 结果:

Traceback (most recent call last):

  File "<ipython-input-56-f9bbacee645c>", line 1, in <module>
    my_func_vec(test_arr)

  File "C:\Users\Tony\Anaconda2\lib\site-packages\numpy\lib\function_base.py", line 2218, in __call__
    return self._vectorize_call(func=func, args=vargs)

  File "C:\Users\Tony\Anaconda2\lib\site-packages\numpy\lib\function_base.py", line 2291, in _vectorize_call
    copy=False, subok=True, dtype=otypes[0])

ValueError: cannot set an array element with a sequence

From the docstring of vectorize it reads about the optional argument otypes vectorize的文档字符串中,它读取有关可选参数otypes

otypes : str or list of dtypes, optional
    The output data type. It must be specified as either a string of
    typecode characters or a list of data type specifiers. There should
    be one data type specifier for each output.

It allows you to create structured arrays having complex output, but also solves your problem where you have lists as array element. 它允许您创建具有复杂输出的结构化数组,但也可以解决将列表作为数组元素的问题。

my_func_vec = np.vectorize(my_func, otypes=[list])

Some comparisons and time tests; 一些比较和时间测试; but keep in mind that is a small example. 但请记住,这只是一个小例子。

In [106]: test_arr = np.array([['the', 'quick', 'brown', 'fox'], ['lorem', 'ipsum'], ['this', 'is', 'a', 'test']])
     ...: 
In [107]: def my_func(l):
     ...:     # accepts list, returns another list
     ...:     # dumbed down list transformation function
     ...:     # for testing, just return the first 2 elems of original list
     ...:     return l[0:2]
     ...: 

The list comprehension method returns a 2d array of strings - because the function returns 2 element lists each time. 列表推导方法返回一个二维字符串数组-因为该函数每次返回2个元素列表。

In [108]: np.array([my_func(x) for x in test_arr])
Out[108]: 
array([['the', 'quick'],
       ['lorem', 'ipsum'],
       ['this', 'is']],
      dtype='<U5')

The input array is object dtype because the sublists differ in length: 输入数组是对象dtype,因为子列表的长度不同:

In [109]: test_arr
Out[109]: 
array([list(['the', 'quick', 'brown', 'fox']), list(['lorem', 'ipsum']),
       list(['this', 'is', 'a', 'test'])], dtype=object)

frompyfunc returns an object dtype array; frompyfunc返回一个对象frompyfunc数组; consistent with my past tests it is modestly faster (2x but never an order of magnitude) 与我过去的测试一致,它的速度要适中(2倍,但从来没有一个数量级)

In [110]: np.frompyfunc(my_func,1,1)(test_arr)
Out[110]: 
array([list(['the', 'quick']), list(['lorem', 'ipsum']),
       list(['this', 'is'])], dtype=object)

In [111]: timeit np.frompyfunc(my_func,1,1)(test_arr)
5.68 µs ± 230 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [112]: timeit np.array([my_func(x) for x in test_arr])
8.96 µs ± 25.6 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

vectorize uses frompyfunc but has more overhead. vectorize使用frompyfunc但具有更多开销。 The otypes is need to avoid the sequence error (otherwise it tries to deduce the return type from a trial calculation): otypes必须避免sequence错误(否则,它将尝试从试验计算中推断出返回类型):

In [113]: np.vectorize(my_func,otypes=[object])(test_arr)
Out[113]: 
array([list(['the', 'quick']), list(['lorem', 'ipsum']),
       list(['this', 'is'])], dtype=object)
In [114]: timeit np.vectorize(my_func,otypes=[object])(test_arr)
30.4 µs ± 132 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
[my_func(x) for x in test_arr]

您需要向下移动一级,您的解决方案仅输出阵列的前2个项目,而不是阵列的每个前2个项目。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM