简体   繁体   English

从列表和数组中过滤空字符串

[英]Filtering empty strings from lists vs arrays

I have a list with empty strings:我有一个空字符串列表:

test = ['foo', '', 'bar', '', 'baz']

The following code will strip the empty strings and return the desired output:以下代码将去除空字符串并返回所需的输出:

list(filter(None, test))
Out:['foo', 'bar', 'baz']

When I turn the list into a numpy array, applying the same function by mapping does not work:当我将列表变成一个 numpy 数组时,通过映射应用相同的函数不起作用:

test = np.array(['foo', '', 'bar', '', 'baz'], dtype='<U15')

def g(x):
    return list(filter(None, x))

def array_map(x):
    return np.array(list(map(g, x)))

array_map(test)
Out: array([list(['f', 'o', 'o']), list([]), list(['b', 'a', 'r']), list([]),
       list(['b', 'a', 'z'])], dtype=object)

Why does this happen and what is the correct, simple method to remove empty strings from a numpy array?为什么会发生这种情况,从 numpy 数组中删除空字符串的正确、简单的方法是什么?

When I turn the list into a numpy array, applying the same function by mapping does not work当我将列表变成一个 numpy 数组时,通过映射应用相同的函数不起作用

Right;对; the function already turns your source sequence into the list that you want to make an array out of , so there is no reason to do any mapping.该函数已经将您的源序列转换为您想要从中创建数组的列表,因此没有理由进行任何映射。

Why does this happen为什么会发生这种情况

Mapping g onto test means that g is separately called with each element of x .g映射到test意味着gx每个元素单独调用。 The elements of test are strings; test的元素是字符串; when list(filter(None, x)) is evaluated with x being one of the strings from test , filter iterates over the characters of the string.list(filter(None, x))与评价x与所述串中的一个testfilter在所述字符串的字符进行迭代。 All of those characters pass the filter, so a list is made that contains them.所有这些字符都通过过滤器,因此会生成一个包含它们的list The map ped version of test , therefore, contains a bunch of lists of characters, which then is passed to np.array .因此, testmap ped 版本包含一堆字符列表,然后将其传递给np.array

and what is the correct, simple method to remove empty strings from a numpy array?从 numpy 数组中删除空字符串的正确、简单的方法是什么?

Well, if you wanted to do it with filter , it would look like passing the Numpy array to a single call to filter , and then constructing a new array from the result.好吧,如果你想用filter来做,它看起来就像将 Numpy 数组传递给对filter单个调用,然后从结果构造一个新数组。 Only, the resulting filter object won't be iterated over automatically by np.array , so you'd have to create eg a list first.只是,生成的filter对象不会被np.array自动迭代,所以你必须先创建一个列表。 Thus:因此:

>>> np.array(list(filter(None, test)), dtype='<U15')
array(['foo', 'bar', 'baz'], dtype='<U15')

(Notice that the dtype needs to be specified explicitly if you want it preserved; otherwise Numpy will infer the smallest type that suffices for the data.) (请注意,如果您希望保留dtype需要明确指定它;否则 Numpy 将推断出满足数据需求的最小类型。)

However, it is better to use Numpy tools for this task.但是,最好使用 Numpy 工具执行此任务。 The idiomatic way to remove data from an array is to create a mask that matches the elements you want, and index with that:从数组中删除数据的惯用方法是创建一个与您想要的元素匹配的掩码,并以此作为索引:

>>> test[test != '']
array(['foo', 'bar', 'baz'], dtype='<U15')

(If you want to remove everything that's false-ish - ie that would fail to satisfy an if condition - you can use the somewhat awkwardly named nonzero method: test[test.nonzero()] .) (如果你想删除所有虚假的东西 - 即无法满足if条件 - 你可以使用有点笨拙命名的nonzero方法: test[test.nonzero()] 。)

In [714]: test = ['foo', '', 'bar', '', 'baz']                                                       

I like the expressiveness of list comprehensions:我喜欢列表推导式的表现力:

In [715]: [s for s in test if s]                                                                     
Out[715]: ['foo', 'bar', 'baz']

This comprehension also works with an array - though it will be slower:这种理解也适用于数组 - 尽管它会更慢:

In [716]: aTest=np.array(test)                                                                       
In [717]: aTest                                                                                      
Out[717]: array(['foo', '', 'bar', '', 'baz'], dtype='<U3')
In [718]: np.array([s for s in aTest if s])                                                          
Out[718]: array(['foo', 'bar', 'baz'], dtype='<U3')

An element of the array tests the same an element of a list.数组的元素测试与列表的元素相同。

filter acts the same way: filter作用相同:

In [724]: list(filter(None, test))                                                                   
Out[724]: ['foo', 'bar', 'baz']
In [725]: list(filter(None, aTest))                                                                  
Out[725]: ['foo', 'bar', 'baz']

Your two function approach ends up applying list to each string, splitting it.您的两个函数方法最终将list应用于每个字符串,将其拆分。 The outer map passes a string to g , not the whole list:外部map将字符串传递给g ,而不是整个列表:

In [728]: def g(x): 
     ...:     return list(filter(None,x))                                                                                      
In [729]: list(map(g,test))                                                                          
Out[729]: [['f', 'o', 'o'], [], ['b', 'a', 'r'], [], ['b', 'a', 'z']]
In [732]: [list(s) for s in test]                                                                    
Out[732]: [['f', 'o', 'o'], [], ['b', 'a', 'r'], [], ['b', 'a', 'z']]
In [734]: list(g(test[0]))                                                                           
Out[734]: ['f', 'o', 'o']

As pointed out in the other answer, you can do the array filtering without a python level iteration:正如另一个答案中所指出的,您可以在没有 python 级别迭代的情况下进行数组过滤:

In [736]: aTest==''                                                                                  
Out[736]: array([False,  True, False,  True, False])
In [737]: aTest[aTest!='']                                                                           
Out[737]: array(['foo', 'bar', 'baz'], dtype='<U3')

For this small sample the list comprehension is fastest.对于这个小样本,列表理解是最快的。 I expect though that with a 1000 string list/array, the array approach will scale better.我希望尽管使用 1000 个字符串列表/数组,数组方法会更好地扩展。

In [740]: timeit [s for s in test if s]                                                              
398 ns ± 10.5 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
In [741]: timeit aTest[aTest!='']                                                                    
3.99 µs ± 158 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [743]: timeit list(filter(None, test))                                                            
503 ns ± 9.6 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM