简体   繁体   English

对字符串列表的子集进行排序

[英]Sorting a subset of a list of strings

I have a list of strings containing the column names of a specific dataframe, and I want to sort a subset of the list so that it follows a certain standard format.我有一个包含特定数据框的列名的字符串列表,我想对列表的一个子集进行排序,使其遵循某种标准格式。

Specifically, to clarify things here is an example :具体来说,为了澄清这里的事情是一个例子:

Input list :输入列表:

array = ['var1', 'var2', 'var3', '2010 a', '2010 b', '2010 c', '2011 a', '2011 b', '2011 c']

Desired output :期望的输出:

array = ['var1', 'var2', 'var3', '2010 a', '2011 a', '2010 b', '2011 b', '2010 c', '2011 c']

In other words there is a subset of the array that should be left untouched (ie var1 var2 var3) and another subset that should be sorted first by the element after the whitespace and then by the element preceding whitespace .换句话说,数组的一个子集应该保持不变(即 var1 var2 var3),另一个子集应该首先按空格后面的元素排序,然后按空格前面的元素排序。

How could this be done efficiently ?如何有效地做到这一点?

In this specific case:在这种特定情况下:

array = ['var1', 'var2', 'var3', '2010 a', '2010 b', '2010 c', '2011 a', '2011 b', '2011 c']

array[3:] = sorted(array[3:], key=lambda s:s.split()[::-1])

the various parts of this should be straightforward.这方面的各个部分应该是直截了当的。 Replace the fourth element onwards with the rest of the sorted list, according to a custom key.根据自定义键,用排序列表的其余部分替换第四个元素。 This custom key will split the element on any whitespace, and then compare them based on the splits in reverse order (last split takes priority).此自定义键将在任何空白处拆分元素,然后根据拆分以相反的顺序比较它们(最后拆分优先)。

This solution assumes that 'var...' (or whatever you want to leave untouched) can appear anywhere.此解决方案假定'var...' (或任何您想保持不变的内容)可以出现在任何地方。

Extract the elements you want to sort, remember their indexes, sort them, then put them back:提取要排序的元素,记住它们的索引,对它们进行排序,然后将它们放回去:

lst = ['var1', 'var2', 'var3', '2010 a', '2010 b', '2010 c', '2011 a', '2011 b', '2011 c']

where, what = zip(*((i, x) for i, x in enumerate(lst) if not x.startswith('var')))
what = sorted(what, key=lambda x: x.split()[::-1])

for i, x in zip(where, what):
    lst[i] = x

print(lst)
# ['var1', 'var2', 'var3', '2010 a', '2011 a', '2010 b', '2011 b', '2010 c', '2011 c']
def sort_second(string_list: list):
    """
    Sort a list of strings according to the value after the string
    """
    output = []
    sorting_dict = {}
    for string in string_list:
        try:
            # split the first and the second values
            value, key = string.split(" ")
            try:
                # something with the same key has already been read in
                # sort this value into a list with the other value(s) from
                # the same key
                insort_left(sorting_dict[key], value)
            except:
                # nothing else with the same key has been read in yet
                sorting_dict[key] = [value]
        except:
            # split didn't work therefore, must be single value entry 
            output.append(string)

    # for loop sorts second key
    for key in sorted(sorting_dict.keys()):
        for value in sorting_dict[key]:
            # list contains values sorted according to the first key
            output.append(" ".join((value, key)))

    return output

I'd need to run some tests but this does the job and should be reasonably quick.我需要运行一些测试,但这可以完成工作并且应该相当快。

I have used dict as opposed to ordereddict because ordereddict is implimented in python rather than C我使用了dict而不是ordereddict,因为ordereddict是在python而不是C中实现的

I think O(n) is nlog(n) but I'm not sure what kind of sort sorted() uses so it may be worse (if it is I'm fairly sure there will be something else to do the job more efficiently)我认为 O(n) 是 nlog(n) 但我不确定sorted()使用什么样的排序,所以它可能会更糟(如果是的话,我很确定会有其他东西可以更有效地完成这项工作)

Edit: I accidentally stated the time complexity as log(n) in the original post, as Kelly pointed out this is impossible.编辑:我不小心在原始帖子中将时间复杂度表示为 log(n),正如凯利指出的那样,这是不可能的。 The correct time complexity (as edited above) is O(n) = nlog(n)正确的时间复杂度(如上编辑)是 O(n) = nlog(n)

Given:鉴于:

array = ['var1', 'var2', 'var3', '2010 a', '2010 b', '2010 c', '2011 a', '2011 b', '2011 c']

desired = ['var1', 'var2', 'var3', '2010 a', '2011 a', '2010 b', '2011 b', '2010 c', '2011 c']

Three easy ways.三种简单的方法。

First, with a split and find if there is a second element:首先,通过split并查找是否有第二个元素:

>>> sorted(array, key=lambda e: "" if len(e.split())==1 else e.split()[1])==desired
True

Or use partition and use the last element:或使用partition并使用最后一个元素:

>>> sorted(array, key=lambda e: e.partition(' ')[2])==desired
True

Or with a regex to remove the first element:或使用正则表达式删除第一个元素:

>>> sorted(array, key=lambda e: re.sub(r'^\S+\s*','', e))==desired
True

All three rely on the fact that Python's sort is stable .这三个都依赖于 Python 的 sort 是stable的事实。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM