简体   繁体   English

对数组中的字符串进行排序,使其稀疏填充

[英]Sorting a string in array, making it sparsely populated

For example, say I have string like: 例如,假设我有以下字符串:

duck duck duck duck goose goose goose dog 

And I want it to be as sparsely populated as possible, say in this case 在这种情况下,我希望它尽可能地人口稀少

duck goose duck goose dog duck goose duck

What sort of algorithm would you recommend? 你会推荐什么样的算法? Snippets of code or general pointers would be useful, languages welcome Python, C++ and extra kudos if you have a way to do it in bash. 代码片段或一般指针是有用的,语言欢迎Python,C ++和额外的荣誉,如果你有办法在bash中做到这一点。

I would sort the array by number of duplicates, starting from the most duplicated element, spread those elements as far apart as possible 我将按重复次数对数组进行排序,从最重复的元素开始,尽可能地将这些元素分开

in your example, duck is duplicated 4 times, so duck will be put in position n*8/4 for n from 0 to 3 inclusive. 在你的例子中,duck被复制了4次,所以对于从0到3(包括0和3)的n,duck将被放置在n * 8/4的位置。

Then put the next most repeated one (goose) in positions n*8/3 + 1 for n from 0 to 2 inclusive, If something is already placed there, just put it in the next empty spot. 然后将下一个最重复的一个(鹅)放在n * 8/3 + 1的位置,从0到2(包括0和2),如果已经放置了某些东西,只需将它放在下一个空位。 etc etc 等等

I think something like this is the general idea: 我认为这样的事情是一般的想法:

L = "duck duck duck duck goose goose goose dog ".split() 

from itertools import cycle, islice, groupby

# from: http://docs.python.org/library/itertools.html#recipes
def roundrobin(*iterables):
    "roundrobin('ABC', 'D', 'EF') --> A D E B F C"
    # Recipe credited to George Sakkis
    pending = len(iterables)
    nexts = cycle(iter(it).next for it in iterables)
    while pending:
        try:
            for next in nexts:
                yield next()
        except StopIteration:
            pending -= 1
            nexts = cycle(islice(nexts, pending))

groups = [list(it) for k,it in groupby(sorted(L))]

# some extra print so you get the idea
print L
print groups
print list(roundrobin(*groups))

Output: 输出:

['dog', 'duck', 'duck', 'duck', 'duck', 'goose', 'goose', 'goose']
[['dog'], ['duck', 'duck', 'duck', 'duck'], ['goose', 'goose', 'goose']]
['dog', 'duck', 'goose', 'duck', 'goose', 'duck', 'goose', 'duck']

So you want some kind of round robin :-) 所以你想要一些循环赛:-)


Well, round-robin is not perfect. 好吧,循环赛并不完美。

Here is the brute force (aka horribly inefficient) version of what you where thinking about. 这是你想到的蛮力(又名非常低效)的版本。

# this is the function we want to maximize
def space_sum( L ):
    """ return the sum of all spaces between all elements in L"""
    unique = set(L)
    def space(val):
        """ count how many elements are between two val """
        c = 0
        # start with the first occurrence of val, then count
        for x in L[1+L.index(val):]: 
            if x==val:
                yield c
                c = 0
            else:
                c += 1
    return sum(sum(space(val)) for val in unique)

print max((space_sum(v), v) for v in permutations(L))

# there are tons of equally good solutions
print sorted(permutations(L), key=space_sum, reverse=True)[:100] 

How to measure sparsity actually? 实际上如何衡量稀疏度? By the way a simple random shuffle may work. 顺便说一下,简单的随机shuffle可能会起作用。

Sort you types by count. 按计数对类型进行排序。

  1. Item Type 1 placed in the linked list. 项目类型1放在链接列表中。 (Store middle link). (存储中间链接)。
  2. Next Item Type count = c total current list size = N. Distribute Item 2 in c using 'bankers rounding' from the middle of the list. 下一个项目类型count = c当前总列表大小= N.使用列表中间的“银行家舍入”在c中分配项目2。

  3. Goto 2. 转到2。

There are good answers above about sorting and separating the most common strings the farthest. 关于排序和分离最常见的字符串,上面有很好的答案。 But if you have so much data that you can't sort or don't want to take the time, look into quasirandom numbers ( http://mathworld.wolfram.com/QuasirandomSequence.html ). 但是,如果您有太多数据无法排序或不想花时间,请查看quasirandom数字( http://mathworld.wolfram.com/QuasirandomSequence.html )。 There's a simple implementation of this in the Numerical Recipes book. 在Numerical Recipes一书中有一个简单的实现。 These are numbers that "look" random, ie, fill a space but try to avoid each other as much as possible. 这些是“看起来”随机的数字,即填充空间但尽可能地避免彼此。 It's used a lot in applications where you want to "randomly" sample something, but rather than true random you want to sample the whole space efficiently. 它在你想要“随机”采样某些东西的应用程序中经常使用,而不是真正随机的,你想要有效地采样整个空间。

If I understood correctly your definition of “sparse”, this function should be exactly what you want: 如果我正确理解了你对“稀疏”的定义,那么这个函数应该正是你想要的:

# python ≥ 2.5
import itertools, heapq

def make_sparse(sequence):
    grouped= sorted(sequence)
    item_counts= []
    for item, item_seq in itertools.groupby(grouped):
        count= max(enumerate(item_seq))[0] + 1
        item_counts.append( (-count, item) ) # negative count for heapq purposes
    heapq.heapify(item_counts)

    count1, item1= heapq.heappop(item_counts)
    yield item1; count1+= 1
    while True:
        try:
            count2, item2= heapq.heappop(item_counts)
        except IndexError: # no other item remains
            break
        yield item2; count2+= 1
        if count1 < 0:
            heapq.heappush(item_counts, (count1, item1))
        item1, count1= item2, count2

    # loop is done, produce remaining item1 items
    while count1 < 0:
        yield item1; count1+= 1

if __name__ == "__main__":
    # initial example
    print list(make_sparse(
        "duck duck duck duck goose goose goose dog".split()))
    # updated example
    print list(make_sparse([
        'duck', 'duck', 'duck', 'duck', 'duck', 'duck',
        'goose', 'goose', 'goose', 'goose', 'dog', 'dog']))
    # now a hard case: item 'a' appears more than:
    # > total_len//2 times if total_len is even
    # > total_len//2+1 times if total_len is odd
    print list(make_sparse("aaaaaabbcc"))

These examples produce this output: 这些示例产生此输出:

['duck', 'goose', 'duck', 'goose', 'duck', 'dog', 'duck', 'goose']
['duck', 'goose', 'duck', 'goose', 'duck', 'dog', 'duck', 'goose', 'duck', 'dog', 'duck', 'goose']
['a', 'b', 'a', 'c', 'a', 'b', 'a', 'c', 'a', 'a']

A subtle note: in the first and second examples, reversing the output order might look more optimal. 一个微妙的注意事项:在第一个和第二个示例中, 反转输出顺序可能看起来更优。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM