简体   繁体   English

Python:由两个元素组成的内存高效的元组列表

[英]Python: Memory efficient sort of a list of tuples by two elements

I have a very large list of tuples that I would like to sort by two elements. 我有一个非常大的元组列表,我想按两个元素排序。 For example: 例如:

List = [('chr1', 34234, 'extrainfo'), ('chr1', 1234, 'extrainfo'), ('chr3', 4234, 'extrainfo'), ('chr1', 3241, 'extrainfo')]

This is a really large list and I wanted to sort using: 这是一个非常大的列表,我想使用以下方式排序:

List = sorted(List, key=lambda i: (i[0], int[1])))

This works well when using smaller lists such as the above example. 这在使用较小的列表(例如上面的示例)时效果很好。 However, when I run my code using my much larger datasets I get memory errors: 但是,当我使用更大的数据集运行代码时,我会收到内存错误:

Python(32306) malloc: *** mmap(size=34684928) failed (error code=12)
*** error: can't allocate region
*** set a breakpoint in malloc_error_break to debug
Traceback (most recent call last):
  File "MyCode.py", line 139, in <module>
    List = sorted(List, key=lambda i: (i[0], int(i[1])))
MemoryError

Some things you can try, roughly in order of difficulty/desirability. 您可以尝试的一些事情,大致按照难度/可取性的顺序。

  • Don't create a sorted copy of the list using sorted() . 不要使用sorted()创建列表的排序副本 Instead, sort the list in place using List.sort() . 而是使用List.sort()对列表进行排序。

  • Sort the list twice, first with key=lambda i: i[1] and then with key=lambda i: i[0] . 对列表进行两次排序,首先使用key=lambda i: i[1] ,然后使用key=lambda i: i[0] This will take longer, but the list of keys will require less space on each pass. 这将花费更长时间,但是每个通道上的键列表将需要更少的空间。 Python`s sort is guaranteed stable in v2.2 and later. Python的排序在v2.2及更高版本中保证稳定。 Sorting on the keys in the reversed order of their importance is the way we used to do it back when we could only sort on one key at a time. 按照它们重要性的相反顺序对键进行排序是我们以前只能在一个键上进行排序的方式。

  • Don't use a key function at all. 根本不要使用按键功能。 Sorting by the items in a tuple in order is the default behavior! 按顺序按元组中的项排序是默认行为! You don't care about the order of the third and subsequent items, so why not just let Python go ahead and sort on them? 你不关心第三个和后续项目的顺序,那么为什么不让Python继续进行排序呢? They'll be in order too, but that's as good as any order. 他们也会按顺序排列,但这和任何订单一样好。 (This won't work if the other elements are types that don't support comparison.) (如果其他元素是不支持比较的类型,则无效。)

  • Use a cmp function rather than a key function if your version of Python is old enough to support it. 如果您的Python版本足够支持它,请使用cmp函数而不是key函数。 This will avoid generating a list of keys, but will be slower and won't work in Python 3. 这将避免生成密钥列表,但会更慢,并且在Python 3中不起作用。

  • Use a 64-bit version of Python on a 64-bit OS on a machine with plenty of memory. 在具有足够内存的计算机上,在64位操作系统上使用64位版本的Python。

  • Implement your own sort. 实现自己的排序。

You may have more luck using structured arrays for this as they are faster than lists for large data sets 使用结构化数组可能会更幸运,因为它们比大型数据集的列表更快

http://docs.scipy.org/doc/numpy/user/basics.rec.html http://docs.scipy.org/doc/numpy/user/basics.rec.html

http://docs.scipy.org/doc/numpy/reference/generated/numpy.sort.html http://docs.scipy.org/doc/numpy/reference/generated/numpy.sort.html

You have 2 options: 1. Increase the size of RAM. 您有两个选择:1。增加RAM的大小。 2. Try to process little data at a time, especially if you are doing operations on corpora or texts, as it appears to be. 2.尝试一次处理少量数据,特别是如果您正在对语料库或文本进行操作,就像它看起来那样。

You seem to have similar keys. 你好像有类似的钥匙。 So try to use the feature counter which is imported from collections. 因此,请尝试使用从集合中导入的要素计数器。 If extra info is different, then you can use nesting. 如果额外的信息不同,那么您可以使用嵌套。

This will save you a lot of trouble. 这样可以省去很多麻烦。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM