简体   繁体   English

Python:如何在未排序的列表中查找大于某个数字的所有项(大数据集)

[英]Python: How to find all the items greater than some number in an unsorted list (large data set)

Although, similar type of questions have been asked by others, for ex. 尽管如此,其他人也曾提出过类似的问题。 here , but they differed slightly and didn't really get solve my problem, so here I go again. 在这里 ,但他们略有不同,并没有真正解决我的问题,所以我再次去。

I have N lists (N>20,000) and each list contains M lists ( M >20,000), in the following manner ( data is dummy): 我有N个列表(N> 20,000),每个列表包含M个列表(M> 20,000),方式如下(数据为虚拟):

Key1: [ [4,3,1], [5,1,0] ...... [43,21,0 ] ]   # List 1 with collection of M smaller lists
:
:
KeyN: [ [5,4,1], [55,1,1] ...... [ 221, 0, 0] ] # Nth list

Data is unsorted . 数据未分类 Iterating over a list of threshold values one by one, say Threshold =[2, 3, 5, 7, 8] , where threshold is applied over middle element, I want to extract all the elements, for all the keys, greater than the threshold value. 逐个迭代一个阈值列表,比如Threshold =[2, 3, 5, 7, 8] ,其中阈值应用于中间元素,我想提取所有键的所有元素,大于门槛值。 For ex. 对于前者 going by the data I wrote above, Threshold = 2 would yield 根据我上面写的数据, Threshold = 2会产生

 For Key1: [ [4,3,1], [43,21,0]]
 :
 : 
 For KeyN: [[5,4,1]]

And similarly for other threshold values too. 同样对于其他阈值也是如此。 Since, there are too many lists, my observation is that sorting is contribute to lot of overhead and hence I want to avoid it. 由于列表太多,我的观察是排序会导致很多开销,因此我想避免它。 What is the optimum method of doing this in python ?. python中执行此操作的最佳方法是什么? One additional important point is that, I am constructing the data myself, so possibly there is a better data structure to store the data to begin with. 另外一个重点是,我自己构建数据,因此可能有一个更好的数据结构来存储数据。 I am currently storing the data in the form of PersistentList within a Btree container in ZODB , which was suggested here . 我目前正在ZODBBtree容器中以PersistentList的形式存储数据,这是在这里建议的。 Following is a snippet of the code used for it: 以下是用于它的代码片段:

for Gnodes in G.nodes():      # Gnodes iterates over N values 
    Gvalue = someoperation(Gnodes)
    for Hnodes in H.nodes():  # Hnodes iterates over N values 
        Hvalue =someoperation(Hnodes,Gnodes)
        score = SomeOperation on (Gvalue,Hvalue)
        btree_container.setdefault(Gnodes, PersistentList()).append([Hnodes, score, -1 ])
    transaction.savepoint(True)  
transaction.commit()

Any suggestions on what should be the most efficient way of doing it? 关于什么应该是最有效的方法的任何建议? Is sorting first indeed the optimum way ? 排序第一的确是最佳方式吗?

Use a generator comprehension: 使用生成器理解:

(sublist for sublist in Key1 if sublist[1] > Threshold)

A generator only computes elements on demand, and since it goes through the elements of the list in order, there's no need to sort. 生成器只按需计算元素,因为它按顺序遍历列表的元素,所以不需要排序。 (That is, it runs in linear time on the length of each Key n , rather than M*log(M) for sorting.) (也就是说,它在每个Key n的长度上以线性时间运行,而不是M * log(M)进行排序。)

Equivalently, in functional style (only equivalent in Python 3; for Python 2, use itertools.ifilter ): 等效地,在函数式中(仅在Python 3中等效;对于Python 2,使用itertools.ifilter ):

filter(lambda sublist: sublist[1] > Threshold, Key1)

If your Key n lists are stored in a list (or other subscriptable object), you can process them all at once (some alternative styles shown): 如果您的Key n列表存储在列表(或其他可订阅对象)中,您可以一次处理它们(显示一些替代样式):

filtered_Keys = [(sublist for sublist in Key if sublist[1] > Threshold)
    for Key in Keys
]

or 要么

filtered_Keys = list(map(
    lambda Key: filter(lambda sublist: sublist[1] > Threshold, Key1),
    Keys
))

Performance of this method relative to sorting 此方法相对于排序的性能

Whether this method is faster than sorting depends on M and the number of thresholds T you have. 此方法是否比排序更快取决于M和您拥有的阈值T的数量。 The running time (for each Key list) is O(M * T). 运行时间(对于每个Key列表)是O(M * T)。 If you sort the list (O(M * log(M))), then you can use binary search for each threshold, giving an overall running time of O(M * log(M) + T * log(M)) = O(max(M, T) * log(M)). 如果对列表进行排序(O(M * log(M))),则可以对每个阈值使用二进制搜索,总运行时间为O(M * log(M)+ T * log(M))= O(max(M,T)* log(M))。 Sorting is faster when T is sufficiently large relative to M . T相对于M足够大时,排序更快。 We can't know the constants a priori, so test both ways to see whether one is faster given your data. 我们无法先验地知道常数,因此请测试两种方法,以确定在给定数据时是否更快。

If neither is fast enough, consider writing your own linear-time sort. 如果两者都不够快,请考虑编写自己的线性时间排序。 For example, radix sort can be generalized to work on (non-negative) floats . 例如, 可以推广基数排序以处理(非负)浮点数 If you're really concerned about performance here, you might have to write this as a C or Cython extension. 如果你真的关心这里的性能,你可能不得不把它写成C或Cython扩展。

In numpy you can do this easily with an NxMx3 array: 在numpy中,您可以使用NxMx3阵列轻松完成此操作:

data = array([
    [ [4,3,1], [5,1,0],  [43,21,0]    ],
    [ [5,4,1], [55,1,1], [ 221, 0, 0] ]
    ])
data[ data[:,:,1]>2 ]

This returns: 返回:

array([[ 4,  3,  1],
   [43, 21,  0],
   [ 5,  4,  1]])

If you need the locations of the elements that crossed threshold, use argwhere(). 如果需要超过阈值的元素的位置,请使用argwhere()。

Edit : 编辑

It's also possible to do multiple threshold comparisons simultaneously: 也可以同时进行多个阈值比较:

>>> mask = data[:,:,1,np.newaxis] > array([[[2, 3, 4]]])
>>> data[mask[...,0]]
array([[ 4,  3,  1],
   [43, 21,  0],
   [ 5,  4,  1]])

>>> data[mask[...,1]]
array([[43, 21,  0],
   [ 5,  4,  1]])

>>> data[mask[...,2]]
array([[43, 21,  0]])

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何从 Python 中的未排序列表中查找已排序列表中的班次数 - How to find the number of shifts in a sorted list from an unsorted list in Python 如何使用 Python 过滤 Pandas 数据帧中所有或部分行值大于 0 的列? - How to filter columns whose all or some rows values are greater than 0 in Pandas data-frame using Python? 如何检查列表中的数字是否大于它之前的数字-python - How to check if a number in a list is greater than the number before it - python 如何打印大于 python 列表中特定数字的 3 个数字? - how to print 3 numbers greater than a specific number of a list in python? Python-如何生成大小大于列表元素个数的排列 - Python - How to generate permutations of size greater than the number of list elements 我如何确定一个列表中的项目是否大于另一个列表中的项目? - How would I find out if items in one list are greater than items in another list? 如何在python中找到大于平均值的列表的最长连续子序列 - How to find the longest consecutive subsequence of a list greater than mean in python 返回列表中大于某个值的项目列表 - Return list of items in list greater than some value 如何打印数字小于 9 和数字大于 10 的项目? - How to print the items with the number less than 9, and number greater than 10? 给定一个未排序的python列表,我如何找到排序所需的最小移动集 - Given an unsorted python list, how can I find the minimum set of movements required to sort it
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM