简体   繁体   English

python heapq排序列表错误?

[英]python heapq sorting list wrong?

I am trying to sort lists into one list that contain numbers and names of sections, sub sections and sub sub sections. 我正在尝试将列表分类为一个列表,其中包含节,子节和子子节的编号和名称。 The program looks like this: 该程序如下所示:

import heapq

sections = ['1. Section', '2. Section', '3. Section', '4. Section', '5. Section', '6. Section', '7. Section', '8. Section', '9. Section', '10. Section', '11. Section', '12. Section']
subsections = ['1.1 Subsection', '1.2 Subsection', '1.3 Subsection', '1.4 Subsection', '2.1 Subsection', '4.1 My subsection', '7.1 Subsection', '8.1 Subsection', '12.1 Subsection']
subsubsections = ['1.2.1 Subsubsection', '1.2.2 Subsubsection', '1.4.1 Subsubsection', '2.1.1 Subsubsection', '7.1.1 Subsubsection', '8.1.1 Subsubsection', '12.1.1 Subsubsection']

sorted_list = list(heapq.merge(sections, subsections, subsubsections))

print(sorted_list)

What I get out is this: 我得到的是这样的:

['1. Section', '1.1 Subsection', '1.2 Subsection', '1.2.1 Subsubsection', '1.2.2 Subsubsection', '1.3 Subsection', '1.4 Subsection', '1.4.1 Subsubsection', '2. Section', '2.1 Subsection', '2.1.1 Subsubsection', '3. Section', '4. Section', '4.1 My subsection', '5. Section', '6. Section', '7. Section', '7.1 Subsection', '7.1.1 Subsubsection', '8. Section', '8.1 Subsection', '12.1 Subsection', '8.1.1 Subsubsection', '12.1.1 Subsubsection', '9. Section', '10. Section', '11. Section', '12. Section']

My 12th subsection, and sub sub section is located within 8th section, not 12th. 我的第12小节和sub小节位于第8小节中,而不是第12小节。

Why is this happening? 为什么会这样呢? The original lists are sorted, and it all goes good, apparently up to number 10. 原始列表已排序,一切顺利,显然达到了第10位。

I'm not sure why this is happening and is there a way to better sort this into a 'tree' based on the numbers in the lists? 我不确定为什么会这样,是否有办法根据列表中的数字将其更好地分类为“树”? I'm building a table of contents of sorts, and this will return (once I filter the list out) 我正在建立一个目录列表,它将返回(一旦我将列表过滤掉)

1. Section
    1.1 Subsection
    1.2 Subsection
        1.2.1 Subsubsection
        1.2.2 Subsubsection
    1.3 Subsection
    1.4 Subsection
        1.4.1 Subsubsection
2. Section
    2.1 Subsection
        2.1.1 Subsubsection
3. Section
4. Section
    4.1 My subsection
5. Section
6. Section
7. Section
    7.1 Subsection
        7.1.1 Subsubsection
8. Section
    8.1 Subsection
    12.1 Subsection
        8.1.1 Subsubsection
        12.1.1 Subsubsection
9. Section
10. Section
11. Section
12. Section

Notice the 12.1 Subsection behind 8.1 Subsection and 12.1.1 Subsubsection behind 8.1.1 Subsubsection. 请注意8.1小节后面的12.1小节和8.1.1小节后面的12.1.1小节。

Your lists may appear sorted, to a human eye. 您的列表可能看起来很杂乱。 But to Python, your inputs are not fully sorted, because it sorts strings lexicographically . 但是对于Python,您的输入没有完全排序,因为它按字典顺序对字符串进行排序。 That means that '12' comes before '8' in sorted order, because only the first characters are compared. 这意味着'12''8'之前按排序顺序排列,因为仅比较了第一个字符

As such, the merge is completely correct; 因此,合并是完全正确的; the string starting '12.1' is encountered after the '8.1' string was seen, but the one starting with '8.1.1' is sorted afterwards. 开始字符串'12.1'后遇到'8.1'的字符串被看到的,但一开始的'8.1.1'是继排序。

You'll have to extract tuples of integers from the strings with a key function to sort correctly: 您必须使用键函数从字符串中提取整数元组才能正确排序:

section = lambda s: [int(d) for d in s.partition(' ')[0].split('.') if d]
heapq.merge(sections, subsections, subsubsections, key=section))

Note that the key argument is only available in Python 3.5 and up; 请注意, key参数仅在Python 3.5及更高版本中可用。 you'd have to do a manual decorate-merge-undecorate dance in earlier versions. 您必须在较早的版本中进行手动装饰,合并和不装饰的舞蹈。

Demo (using Python 3.6): 演示(使用Python 3.6):

>>> section = lambda s: [int(d) for d in s.partition(' ')[0].split('.') if d]
>>> sorted_list = list(heapq.merge(sections, subsections, subsubsections, key=section))
>>> from pprint import pprint
>>> pprint(sorted_list)
['1. Section',
 '1.1 Subsection',
 '1.2 Subsection',
 '1.2.1 Subsubsection',
 '1.2.2 Subsubsection',
 '1.3 Subsection',
 '1.4 Subsection',
 '1.4.1 Subsubsection',
 '2. Section',
 '2.1 Subsection',
 '2.1.1 Subsubsection',
 '3. Section',
 '4. Section',
 '4.1 My subsection',
 '5. Section',
 '6. Section',
 '7. Section',
 '7.1 Subsection',
 '7.1.1 Subsubsection',
 '8. Section',
 '8.1 Subsection',
 '8.1.1 Subsubsection',
 '9. Section',
 '10. Section',
 '11. Section',
 '12. Section',
 '12.1 Subsection',
 '12.1.1 Subsubsection']

The keyed merge is easily backported to Python 3.3 and 3.4: 密钥合并很容易向后移植到Python 3.3和3.4:

import heapq

def _heappop_max(heap):
    lastelt = heap.pop()
    if heap:
        returnitem = heap[0]
        heap[0] = lastelt
        heapq._siftup_max(heap, 0)
        return returnitem
    return lastelt

def _heapreplace_max(heap, item):
    returnitem = heap[0]
    heap[0] = item
    heapq._siftup_max(heap, 0)
    return returnitem

def merge(*iterables, key=None, reverse=False):    
    h = []
    h_append = h.append

    if reverse:
        _heapify = heapq._heapify_max
        _heappop = _heappop_max
        _heapreplace = _heapreplace_max
        direction = -1
    else:
        _heapify = heapify
        _heappop = heappop
        _heapreplace = heapreplace
        direction = 1

    if key is None:
        for order, it in enumerate(map(iter, iterables)):
            try:
                next = it.__next__
                h_append([next(), order * direction, next])
            except StopIteration:
                pass
        _heapify(h)
        while len(h) > 1:
            try:
                while True:
                    value, order, next = s = h[0]
                    yield value
                    s[0] = next()           # raises StopIteration when exhausted
                    _heapreplace(h, s)      # restore heap condition
            except StopIteration:
                _heappop(h)                 # remove empty iterator
        if h:
            # fast case when only a single iterator remains
            value, order, next = h[0]
            yield value
            yield from next.__self__
        return

    for order, it in enumerate(map(iter, iterables)):
        try:
            next = it.__next__
            value = next()
            h_append([key(value), order * direction, value, next])
        except StopIteration:
            pass
    _heapify(h)
    while len(h) > 1:
        try:
            while True:
                key_value, order, value, next = s = h[0]
                yield value
                value = next()
                s[0] = key(value)
                s[2] = value
                _heapreplace(h, s)
        except StopIteration:
            _heappop(h)
    if h:
        key_value, order, value, next = h[0]
        yield value
        yield from next.__self__

A decorate-sort-undecorate merge is as simple as: decorate-sort-unecorate合并非常简单:

def decorate(iterable, key):
    for elem in iterable:
        yield key(elem), elem

sorted = [v for k, v in heapq.merge(
    decorate(sections, section), decorate(subsections, section)
    decorate(subsubsections, section))]

Because your input is already sorted, using a merge sort is more efficient. 因为您的输入已经排序,所以使用合并排序更为有效。 As a last resort, you could just use sorted() however: 作为最后的选择,您可以只使用sorted()

from itertools import chain
result = sorted(chain(sections, subsections, subsubsections), key=section)

As explained in other answer you have to specify a sorting method, otherwise python will sort the strings lexicographically. 如其他答案中所述,您必须指定一种排序方法,否则python将按字典顺序对字符串进行排序。 If you are using python 3.5+ you can use key argument in merge function, in python 3.5- you can use itertools.chain and sorted , and as a general approach you can use regex in order to find the numbers and convert them to int : 如果您使用的是python 3.5+,则可以在merge函数中使用key参数,在python 3.5-中,您可以使用itertools.chainsorted ,作为一种通用方法,可以使用regex来查找数字并将其转换为int:

In [18]: from itertools import chain
In [19]: import re
In [23]: sorted(chain.from_iterable((sections, subsections, subsubsections)),
                key = lambda x: [int(i) for i in re.findall(r'\d+', x)])
Out[23]: 
['1. Section',
 '1.1 Subsection',
 '1.2 Subsection',
 '1.2.1 Subsubsection',
 '1.2.2 Subsubsection',
 '1.3 Subsection',
 '1.4 Subsection',
 '1.4.1 Subsubsection',
 '2. Section',
 '2.1 Subsection',
 '2.1.1 Subsubsection',
 '3. Section',
 '4. Section',
 '4.1 My subsection',
 '5. Section',
 '6. Section',
 '7. Section',
 '7.1 Subsection',
 '7.1.1 Subsubsection',
 '8. Section',
 '8.1 Subsection',
 '8.1.1 Subsubsection',
 '9. Section',
 '10. Section',
 '11. Section',
 '12. Section',
 '12.1 Subsection',
 '12.1.1 Subsubsection']

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM