简体   繁体   English

从列表中生成 2×2 元组并在 python 中查找重复的元组

[英]Generating 2-by-2 tuples from list and finding the duplicated tuples in python

I am a beginner in Python and I am having trouble generating and identifying duplicates on tuples on my dataFrame.我是 Python 的初学者,在我的 dataFrame 上的元组上生成和识别重复项时遇到问题。

First I have this list of userid:首先我有这个用户ID列表:

'userid': ["us1", "us2", "us1", "us2", "us4", "us4", "us5", "us1", "us2"]

And I want to generate 2-by-2 tuples at the order the userid are in the list, so it would be:我想按照用户 ID 在列表中的顺序生成 2×2 元组,所以它是:

[('us1', 'us2'),
 ('us2', 'us1'),
 ('us1', 'us2'),
 ('us2', 'us4'),
 ('us4', 'us4'),
 ('us4', 'us5'),
 ('us5', 'us1'),
 ('us1', 'us2')]

But the tuples I arrive are this ones (and I don't understand why):但是我到达的元组是这个(我不明白为什么):

 [('us1', 'us2'),
 ('us2', 'us1'),
 ('us1', 'us4'),
 ('us4', 'us2'),
 ('us2', 'us5'),
 ('us5', 'us4'),
 ('us4', 'us1'),
 ('us1', 'us2')]

Here is my code:这是我的代码:

   d = {'id': ["a", "a", "a", "a", "a", "a", "a", "a", "a"], 'id2': ["b", "b", "b", "b", "b", "b", "b", "b", "b"], 'userid': ["us1", "us2", "us1", "us2", "us4", "us4", "us5", "us1", "us2"], "time": [1, 2, 3, 5, 4, 7, 6, 8, 9]}
    df_test = pd.DataFrame(data=d).sort_values('time')
    df_test.groupby(['id','id2']).agg(lambda x: x.tolist()).reset_index()
    test2 = list(zip(df_test.userid[:-1], df_test.userid[1:]))
    zipped_list = test2[:]
    list(test2)

-> In addition, my next step will be finding duplicates on this tuples and extracting them for a new list, so in the case of the tuple: -> 此外,我的下一步将是在此元组上查找重复项并将它们提取为一个新列表,因此对于元组:

    [('us1', 'us2'),
     ('us2', 'us1'),
     ('us1', 'us2'),
     ('us2', 'us4'),
     ('us4', 'us4'),
     ('us4', 'us5'),
     ('us5', 'us1'),
     ('us1', 'us2')]

Should be the list [('us1', 'us2'), 3] because is the only tuple that appears duplicated and the '3' is to say that appears 3 times this duplication.应该是列表[('us1', 'us2'), 3]因为它是唯一出现重复的元组,而 '3' 就是说出现了 3 次重复。

Therefore I cannot find my error on generating the tuples on the order I want nor having any idea on how to find the duplicates.因此,我找不到按我想要的顺序生成元组的错误,也不知道如何找到重复项。

Let us do frozenset + value_counts让我们做frozenset + value_counts

pd.Series(list(map(frozenset,zipped_list))).value_counts()
(us2, us1)    3
(us1, us4)    2
(us2, us5)    1
(us5, us4)    1
(us2, us4)    1
dtype: int64

If only need the list reorder如果只需要列表重新排序

l=list(map(frozenset,zipped_list))

Or we can do numpy或者我们可以做numpy

np.sort(zipped_list,axis=1).tolist()
[['us1', 'us2'], ['us1', 'us2'], ['us1', 'us4'], ['us2', 'us4'], ['us2', 'us5'], ['us4', 'us5'], ['us1', 'us4'], ['us1', 'us2']]

Update: you sort_values first, so we need sort_index back更新:你先sort_values ,所以我们需要sort_index

list(zip(df_test.userid[:-1].sort_index(), df_test.userid[1:].sort_index()))
[('us1', 'us2'), ('us2', 'us1'), ('us1', 'us2'), ('us2', 'us4'), ('us4', 'us4'), ('us4', 'us5'), ('us5', 'us1'), ('us1', 'us2')]

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM