简体   繁体   English

Python IterTools分组

[英]Python itertools groupby

Let's say, I have the following list of tuples 假设我有以下元组列表

[('FRG', 'MCO TPA PIE SRQ', 'WAVEY EMJAY J174 SWL CEBEE '), 
('                    ', 'FMY RSW APF', 'WETRO DIW AR22 JORAY HILEY4'),
('                    ', 'FMY RSW APF', 'WETRO DIW AR22 JORAY HILEY4')
('FRG2', 'MCO TPA PIE SRQ', 'WAVEY EMJAY J174 SWL CEBEE '), 
('                    ', 'FMY RSW APF', 'WETRO DIW AR22 JORAY HILEY4')]

How do I group these to have aa dict in the end like: 我如何将它们分组以最终获得命令,例如:

{'FRG': ['MCO TPA PIE SRQ', 'WAVEY EMJAY J174 SWL CEBEE ', 'FMY RSW APF', 'WETRO DIW AR22 JORAY HILEY4', 'FMY RSW APF', 'WETRO DIW AR22 JORAY HILEY4'],
 'FRG2': ...}

That is to say, I'd like to glue together the parts where the tuple[0] is a word with the (potentially numerous) following parts where the tuple[0] is empty (contains only whitespaces). 就是说,我想将tuple[0]是一个单词的部分与(可能很多) tuple[0]为空的部分(仅包含空白)粘合在一起。
I was experimenting with groupby and takewhile from itertools but haven't reached any working solution. 我用实验groupbytakewhileitertools ,但没有达成任何可行的解决方案。 Ideally, the solution contains one of these (for learning purposes, that is). 理想情况下,解决方案包含其中之一(出于学习目的)。

Not that I recommend it, but to use itertools.groupby() for this, you'd need a key function that remembers the last used key. 并非我建议这样做,而是要使用itertools.groupby() ,您需要一个可以记住上次使用过的钥匙的钥匙功能。 Something like this: 像这样:

def keyfunc(item, keys=[None]):
    if item[0] != keys[-1] and not item[0].startswith(" "):
        keys.append(item[0])        
    return keys[-1] 

d = {k: [y for x in g for y in x[1:]] for k, g in groupby(lst, key=keyfunc)}

A simple for loop looks cleaner and doesn't requre any import s: 一个简单的for循环看起来更干净,并且不需要任何import

d, key = {}, None
for item in lst:
    if item[0] != key and not item[0].startswith(" "):
        key = item[0]
    d.setdefault(key, []).extend(item[1:])

The solution using collections.defaultdict subclass: 使用collections.defaultdict子类的解决方案:

l = [('FRG', 'MCO TPA PIE SRQ', 'WAVEY EMJAY J174 SWL CEBEE '),
('                    ', 'FMY RSW APF', 'WETRO DIW AR22 JORAY HILEY4'),
('                    ', 'FMY RSW APF', 'WETRO DIW AR22 JORAY HILEY4'),
('FRG2', 'MCO TPA PIE SRQ', 'WAVEY EMJAY J174 SWL CEBEE '),
('                    ', 'FMY RSW APF', 'WETRO DIW AR22 JORAY HILEY4')]

d = collections.defaultdict(list)
k = ''
for t in l:
    if t[0].strip():  # if the 1st value of a tuple is not empty
        k = t[0]      # capturing dict key
    if k:
        d[k].append(t[1])
        d[k].append(t[2])

print(dict(d))

The output: 输出:

{'FRG2': ['MCO TPA PIE SRQ', 'WAVEY EMJAY J174 SWL CEBEE ', 'FMY RSW APF', 'WETRO DIW AR22 JORAY HILEY4'], 'FRG': ['MCO TPA PIE SRQ', 'WAVEY EMJAY J174 SWL CEBEE ', 'FMY RSW APF', 'WETRO DIW AR22 JORAY HILEY4', 'FMY RSW APF', 'WETRO DIW AR22 JORAY HILEY4']}

The functions groupby and takewhile aren't good fits for this sort of problem. 函数groupbytakewhile不适用于此类问题。

groupby

groupby groups based on a key function. groupby基于一组key功能。 That means you need to keep the last encountered first non whitespace tuple element to make it work. 这意味着您需要保留最后遇到的第一个非空白元组元素以使其工作。 That means you keep some global state around. 这意味着您需要保持一些全局状态。 By keeping such a state the function is said to be "unpure" while most (or even all) itertools are pure functions. 通过保持这种状态,该函数被称为“不纯函数”,而大多数(甚至所有)迭代工具都是纯函数。

from itertools import groupby, chain

d = [('FRG',                  'MCO TPA PIE SRQ', 'WAVEY EMJAY J174 SWL CEBEE '), 
     ('                    ', 'FMY RSW APF',     'WETRO DIW AR22 JORAY HILEY4'),
     ('                    ', 'FMY RSW APF',     'WETRO DIW AR22 JORAY HILEY4'),
     ('FRG2',                 'MCO TPA PIE SRQ', 'WAVEY EMJAY J174 SWL CEBEE '), 
     ('                    ', 'FMY RSW APF',     'WETRO DIW AR22 JORAY HILEY4')]

def keyfunc(item):
    first = item[0]
    if first.strip():
        keyfunc.state = first
    return keyfunc.state

{k: [item for idx, item in enumerate(chain.from_iterable(grp)) if idx%3 != 0] for k, grp in groupby(d, keyfunc)}

takewhile

takewhile needs to look ahead to determine when to stop yield ing values. takewhile需要提前确定何时停止yield值。 That means it will automatically pop one value more from the iterator than actually used for each group. 这意味着它将自动从迭代器弹出一个值,而不是每个组实际使用的值。 To actually apply it you would need to remember the last position and then create a new iterator each time. 要实际应用它,您需要记住最后一个位置,然后每次创建一个新的迭代器。 It also has the problem that you would need to keep some sort of state because you want to take one element with not-space first element and then the ones that have an space-only first element. 还有一个问题是,您需要保持某种状态,因为您要使用一个不带空格的第一个元素的元素,然后取一个仅带空格的第一个元素的元素。

One approach could look like this (but feels unnecessarily complicated): 一种方法可能看起来像这样(但感觉不必要地复杂):

from itertools import takewhile, islice

def takegen(inp):
    idx = 0
    length = len(inp)
    while idx < length:
        first, *rest = inp[idx]
        rest = list(rest)
        for _, *lasts in takewhile(lambda x: not x[0].strip(), islice(inp, idx+1, None)):
            rest.extend(lasts)
        idx += len(rest) // 2
        yield first, rest

dict(takegen(d))

Alternative 替代

You could simply create your own generator to make this quite easy. 您可以简单地创建自己的生成器来简化此过程。 It's a variation of the takewhile approach but it doesn't need external state, islice , takewhile , groupby or that one keeps track of the index: 这是takewhile方法的一种变体,但是它不需要外部状态, islicetakewhilegroupby或跟踪索引的方法:

def gen(inp):
    # Initial values
    last = None
    for first, *rest in inp:
        if last is None:       # first encountered item
            last = first
            l = list(rest)
        elif first.strip():    # when the first tuple item isn't all whitespaces
            # Yield the last "group"
            yield last, l
            # New values for the next "group"
            last = first
            l = list(rest)
        else:                  # when the first tuple item is all whitespaces
            l.extend(rest)
    # Yield the last group
    yield last, l

dict(gen(d))
# {'FRG2': ['MCO TPA PIE SRQ', 'WAVEY EMJAY J174 SWL CEBEE ', 'FMY RSW APF', 'WETRO DIW AR22 JORAY HILEY4'], 
#  'FRG': ['MCO TPA PIE SRQ', 'WAVEY EMJAY J174 SWL CEBEE ', 'FMY RSW APF', 'WETRO DIW AR22 JORAY HILEY4', 'FMY RSW APF', 'WETRO DIW AR22 JORAY HILEY4']}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM