简体   繁体   English

Python中k,v元组列表中的唯一组合

[英]Unique Combinations in a list of k,v tuples in Python

I have a list of various combos of items in tuples 我有一个元组中各种项目组合的列表

example = [(1,2), (2,1), (1,1), (1,1), (2,1), (2,3,1), (1,2,3)]

I wish to group and count by unique combinations 我希望通过独特的组合进行分组和计数

yielding the result 产生结果

result = [((1,2), 3), ((1,1), 2), ((2,3,1), 2)]

It is not important that the order is maintained or which permutation of the combination is preserved but it is very important that operation be done with a lambda function and the output format be still a list of tuples as above because I will be working with a spark RDD object 维护订单或保留组合的排列并不重要,但是使用lambda函数完成操作非常重要,输出格式仍然是上面的元组列表,因为我将使用spark RDD对象

My code currently counts patterns taken from a data set using 我的代码目前计算从数据集中获取的模式

RDD = sc.parallelize(example) result = RDD.map(lambda(y):(y, 1))\\ .reduceByKey(add)\\ .collect() print result

I need another .map command that will add account for different permutations as explained above 我需要另一个.map命令,它将为不同的排列添加帐户,如上所述

You can use an OrderedDict to crate an ordered dictionary based on sorted case of its items : 您可以使用OrderedDict根据其项目的已排序大小来创建有序字典:

>>> from collections import OrderedDict
>>> d=OrderedDict()
>>> for i in example:
...   d.setdefault(tuple(sorted(i)),i)
... 
('a', 'b')
('a', 'a', 'a')
('a', 'a')
('a', 'b')
('c', 'd')
('b', 'c', 'a')
('b', 'c', 'a')
>>> d
OrderedDict([(('a', 'b'), ('a', 'b')), (('a', 'a', 'a'), ('a', 'a', 'a')), (('a', 'a'), ('a', 'a')), (('c', 'd'), ('c', 'd')), (('a', 'b', 'c'), ('b', 'c', 'a'))])
>>> d.values()
[('a', 'b'), ('a', 'a', 'a'), ('a', 'a'), ('c', 'd'), ('b', 'c', 'a')]

How about this: maintain a set that contains the sorted form of each item you've already seen. 怎么样:维护一个包含您已经看过的每个项目的排序形式的集合。 Only add an item to the result list if you haven't seen its sorted form already. 如果您尚未看到已排序的表单,则只将项目添加到结果列表中。

example = [ ('a','b'), ('a','a','a'), ('a','a'), ('b','a'), ('c', 'd'), ('b','c','a'), ('a','b','c') ]
result = []
seen = set()
for item in example:
    sorted_form = tuple(sorted(item))
    if sorted_form not in seen:
        result.append(item)
        seen.add(sorted_form)
print result

Result: 结果:

[('a', 'b'), ('a', 'a', 'a'), ('a', 'a'), ('c', 'd'), ('b', 'c', 'a')]

Since you are looking for a lambda function, try the following: 由于您正在寻找lambda函数,请尝试以下方法:

lambda x, y=OrderedDict(): [a for a in x if y.setdefault(tuple(sorted(a)), a) and False] or y.values()

You can use this lambda function like so: 你可以像这样使用这个lambda函数:

uniquify = lambda x, y=OrderedDict(): [a for a in x if y.setdefault(tuple(sorted(a)), a) and False] or y.values()
result = uniquify(example)

Obviously, this sacrifices readability over the other answers. 显然,这会牺牲其他答案的可读性。 It is basically doing the same thing as Kasramvd's answer, in a single ugly line. 它基本上与Kasramvd的答案一样,只是在一条丑陋的线条中。

This is similar as the sorted dict. 这与排序的字典类似。

from itertools import groupby
ex = [(1,2,3), (3,2,1), (1,1), (2,1), (1,2), (3,2), (2,3,1)]
f = lambda x: tuple(sorted(x)) as key
[tuple(k) for k, _ in groupby(sorted(ex, key=f), key=f)]

The nice thing is that you can get which are tuples are of the same combination: 好消息是你可以得到哪些是元组是相同的组合:

In [16]: example = [ ('a','b'), ('a','a','a'), ('a','a'), ('a', 'a', 'a', 'a'), ('b','a'), ('c', 'd'), ('b','c','a'), ('a','b','c') ]
In [17]: for k, grpr in groupby(sorted(example, key=lambda x: tuple(sorted(x))), key=lambda x: tuple(sorted(x))):
    print k, list(grpr)
   ....:     
('a', 'a') [('a', 'a')]
('a', 'a', 'a') [('a', 'a', 'a')]
('a', 'a', 'a', 'a') [('a', 'a', 'a', 'a')]
('a', 'b') [('a', 'b'), ('b', 'a')]
('a', 'b', 'c') [('b', 'c', 'a'), ('a', 'b', 'c')]
('c', 'd') [('c', 'd')]

What you actually seem to need based on the comments, is map-reduce. 根据评论,您实际上需要的是map-reduce。 I don't have Spark installed, but according to the docs (see transformations ) this must be like this: 我没有安装Spark,但根据文档(参见转换 ),这必须是这样的:

data.map(lambda i: (frozenset(i), i)).reduceByKey(lambda _, i : i)

This however will return (b, a) if your dataset has (a, b), (b, a) in that order. 但是(b, a)如果您的数据集按此顺序包含(a, b), (b, a) ,则返回(b, a)

I solved my own problem, but it was difficult to understand what I was really looking for I used 我解决了自己的问题,但很难理解我用的是什么

example = [(1,2), (1,1,1), (1,1), (1,1), (2,1), (3,4), (2,3,1), (1,2,3)]
RDD = sc.parallelize(example)
result = RDD.map(lambda x: list(set(x)))\
            .filter(lambda x: len(x)>1)\
            .map(lambda(x):(tuple(x), 1))\
            .reduceByKey(add)\
            .collect()
print result

which also eliminated simply repeated values such as (1,1) and (1,1,1) which was of added benefit to me 这也消除了简单的重复值,如(1,1)和(1,1,1),这对我有好处

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM