简体   繁体   English

在Python中按值切片浮点列表

[英]Slice List of floats by value in Python

I have a list of several thousand floats that I want to be able to slice by min and max values. 我有一个数千个浮点数的列表,我希望能够按最小值和最大值进行切片。

EG using: EG使用:

flist = [1.9842, 9.8713, 5.4325, 7.6855, 2.3493, 3.3333]

(my actual list is 400,000 floats long, but the above is a working example) (我的实际列表是40万个浮点数,但以上是一个有效的示例)

I want something like 我想要类似的东西

def listclamp(minn, maxn, nlist):

such that 这样

print listclamp(3, 8, flist)

should give me 应该给我

[3.3333, 5.4325, 7.6855]

I also need to do this 10,000 to 30,000 times, so speed does count. 我还需要做10,000至30,000次,因此速度确实很重要。

(I have no sample code for what I've tried so far, because this is new python territory for me) (到目前为止,我没有尝试过的示例代码,因为这对我来说是新的python领域)

The obvious thing to do is either sort then filter, or filter then sort. 显而易见的事情是要么先排序再过滤,要么先过滤再排序。

If you have the same list every time, sorting first is obviously a win, because then you only need to sort once instead of every time. 如果您每次都具有相同的列表,则首先进行排序显然是一个胜利,因为那样您就只需要排序一次即可,而不必每次都进行排序。 It also means you can use a binary search for the filtering instead of a linear walk (as explained in ventsyv's answer —although that probably won't pay off unless your lists are much longer than this one. 这也意味着您可以使用二进制搜索来进行过滤,而不是使用线性遍历(如ventsyv的答案中所述 —尽管除非您的列表比该列表长得多,否则这可能不会奏效

If you have different lists every time, filtering first is probably a win, because the sort is probably the slow part, and you're sorting a smaller list that way. 如果每次都有不同的列表,则首先进行过滤可能是一个成功,因为排序可能是比较慢的部分,并且您正在以这种方式对较小的列表进行排序。

But let's stop speculating and start testing. 但是,让我们停止猜测并开始测试。

Using a list of several thousand floats, about half of which are in range: 使用数千个浮点的列表,其中大约一半在范围内:

In [1591]: flist = [random.random()*10 for _ in range(5000)]
In [1592]: %timeit sorted(x for x in flist if 3 <= x < 8)
100 loops, best of 3: 3.12 ms per loop
In [1593]: %timeit [x for x in sorted(flist) if 3 <= x < 8]
100 loops, best of 3: 4 ms per loop
In [1594]: %timeit l=sorted(flist); l[bisect.bisect_left(l, 3):bisect.bisect_right(l, 8)]
100 loops, best of 3: 3.36 ms per loop

So, filtering then sorting wins; 因此,先过滤再排序就可以了; ventsyn's algorithm does make up for part of the difference, but not all of it. ventsyn的算法确实弥补了部分差异,但并非全部。 But course if we only have a single list to sort, sorting it once instead of thousands of times is an obvious win: 但是,当然,如果我们只需要排序一个列表,那么对它进行一次而不是数千次的排序显然是一个胜利:

In [1596]: l = sorted(flist)
In [1597]: %timeit l[bisect.bisect_left(l, 3):bisect.bisect_right(l, 8)]
10000 loops, best of 3: 29.2 µs per loop

So, if you have the same list over and over, obviously sort it once. 因此,如果您一遍又一遍地拥有相同的列表,显然可以对它进行一次排序。

Otherwise, you could test on your real data… but we're talking about shaving up to 22% off of something that takes milliseconds. 否则,您可以测试您的真实数据……但我们正在谈论的是将耗时数毫秒的数据减少22%。 Even if you do it many thousands of times, that's saving you under a second. 即使您执行了数千次,也可以节省一秒钟的时间。 Just the cost of typing the different implementations—much less understanding them, generalizing them, debugging them, and performance testing them—is more than that. 仅仅是键入不同实现的成本(要少得多地理解它们,对其进行概括,对其进行调试以及对其性能进行测试)就远远超过了。


But really, if you're doing millions of operations spread over hundreds of thousands of values, and speed is important, you shouldn't be using a list in the first place, you should be using a NumPy array. 但是,实际上,如果您要执行跨越数十万个值的数百万次操作,并且速度很重要,那么您不应该首先使用列表,而应该使用NumPy数组。 NumPy can store just the raw float values, without boxing them up as Python objects. NumPy可以仅存储原始float值,而无需将它们装箱成Python对象。 Besides saving memory (and improving cache locality), this means that the inner loop in, say, np.sort is faster than the inner loop in sorted , because it doesn't have to make a Python function call that ultimately involves unboxing two numbers, it just has to do a comparison directly. 除了节省内存(并改善缓存的局部性)之外,这意味着np.sort的内部循环比sorted的内部循环要快,因为它不需要进行Python函数调用,而最终需要拆箱两个数字,它只需要直接进行比较。

Assuming you're storing your values in an array in the first place, how does it stack up? 假设您首先将值存储在数组中,那么它如何堆叠?

In [1607]: flist = np.random.random(5000) * 10
In [1608]: %timeit a = np.sort(flist); a = a[3 <= a]; a = a[a < 8]
1000 loops, best of 3: 742 µs per loop
In [1611]: %timeit c = b[3 <= b]; d = c[c < 8]
10000 loops, best of 3: 29.8 µs per loop

So, it's about 4x faster than filter-and-sort for the "different lists" case, even using a clunky algorithm (I was looking for something I could cram onto one %timeit line, rather than the fastest or most readable…). 因此,即使使用笨拙的算法,它也比“不同列表”情况下的过滤和排序快4倍(我一直在寻找可以填充到%timeit行中的内容,而不是最快或最易读的内容……)。 And for the "same list over and over" case, it's almost as fast as the bisect solution even without bisecting (but of course you can bisect with NumPy, too). 对于“一遍又一遍的相同列表”情况,即使二等分,它的速度几乎与二等分解决方案一样快(但当然也可以使用NumPy二等分)。

Sort the list (if you use the same list over and over, sort it just once), then use binary search to find the position of the lower and upper bounds. 对列表进行排序(如果您一遍又一遍地使用相同的列表,请对其进行一次排序),然后使用二进制搜索找到上下限的位置。 Think of it, there is a package that does - bisect. 想一想,有一个包可以做-二等分。

This will return sorted list you want: 这将返回您想要的排序列表:

flist = [1.9842, 9.8713, 5.4325, 7.6855, 2.3493, 3.3333]

def listclamp(minn, maxn, nlist): 
    return sorted(filter(lambda x: xminn <= x <= maxn, nlist))

print listclamp(3, 8, flist) 

A faster approach, using list comprehensions : 使用列表推导的一种更快的方法:

def listclamp2(minn, maxn, nlist): 
    return sorted([x for x in flist if (minn <= and x<=maxn)])

print listclamp2(3, 8, flist)

Note that depending on your data it may be better to filter the list first and then sort it (as I did in the code above). 请注意,根据您的数据,最好先过滤列表然后对其进行排序(就像我在上面的代码中所做的那样)。

For more information on performance, refer to this link . 有关性能的更多信息,请参考此链接

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM