I am a beginner in Python and I am having trouble generating and identifying duplicates on tuples on my dataFrame.
First I have this list of userid:
'userid': ["us1", "us2", "us1", "us2", "us4", "us4", "us5", "us1", "us2"]
And I want to generate 2-by-2 tuples at the order the userid are in the list, so it would be:
[('us1', 'us2'),
('us2', 'us1'),
('us1', 'us2'),
('us2', 'us4'),
('us4', 'us4'),
('us4', 'us5'),
('us5', 'us1'),
('us1', 'us2')]
But the tuples I arrive are this ones (and I don't understand why):
[('us1', 'us2'),
('us2', 'us1'),
('us1', 'us4'),
('us4', 'us2'),
('us2', 'us5'),
('us5', 'us4'),
('us4', 'us1'),
('us1', 'us2')]
Here is my code:
d = {'id': ["a", "a", "a", "a", "a", "a", "a", "a", "a"], 'id2': ["b", "b", "b", "b", "b", "b", "b", "b", "b"], 'userid': ["us1", "us2", "us1", "us2", "us4", "us4", "us5", "us1", "us2"], "time": [1, 2, 3, 5, 4, 7, 6, 8, 9]}
df_test = pd.DataFrame(data=d).sort_values('time')
df_test.groupby(['id','id2']).agg(lambda x: x.tolist()).reset_index()
test2 = list(zip(df_test.userid[:-1], df_test.userid[1:]))
zipped_list = test2[:]
list(test2)
-> In addition, my next step will be finding duplicates on this tuples and extracting them for a new list, so in the case of the tuple:
[('us1', 'us2'),
('us2', 'us1'),
('us1', 'us2'),
('us2', 'us4'),
('us4', 'us4'),
('us4', 'us5'),
('us5', 'us1'),
('us1', 'us2')]
Should be the list [('us1', 'us2'), 3]
because is the only tuple that appears duplicated and the '3' is to say that appears 3 times this duplication.
Therefore I cannot find my error on generating the tuples on the order I want nor having any idea on how to find the duplicates.
Let us do frozenset
+ value_counts
pd.Series(list(map(frozenset,zipped_list))).value_counts()
(us2, us1) 3
(us1, us4) 2
(us2, us5) 1
(us5, us4) 1
(us2, us4) 1
dtype: int64
If only need the list reorder
l=list(map(frozenset,zipped_list))
Or we can do numpy
np.sort(zipped_list,axis=1).tolist()
[['us1', 'us2'], ['us1', 'us2'], ['us1', 'us4'], ['us2', 'us4'], ['us2', 'us5'], ['us4', 'us5'], ['us1', 'us4'], ['us1', 'us2']]
Update: you sort_values
first, so we need sort_index
back
list(zip(df_test.userid[:-1].sort_index(), df_test.userid[1:].sort_index()))
[('us1', 'us2'), ('us2', 'us1'), ('us1', 'us2'), ('us2', 'us4'), ('us4', 'us4'), ('us4', 'us5'), ('us5', 'us1'), ('us1', 'us2')]
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.