I would like to group the row together based on the common value in any column.
I have the table that look like this
index | phone | UserID | |
---|---|---|---|
1 | abc@gmail.com | 123456 | 1 |
2 | def@gmail.com | NaN | 2 |
3 | NaN | 123456 | NaN |
4 | def@gmail.com | 987654 | NaN |
5 | NaN | NaN | 1 |
How can I group together index 1,3,5 (because index 1 and 3 have common phone number, and index 1 and 5 have common UserID)
index | phone | UserID | |
---|---|---|---|
1 | abc@gmail.com | 123456 | 1 |
3 | NaN | 123456 | NaN |
5 | NaN | NaN | 1 |
and group together index 2, 4 (because index 2 and 4 have common email)
index | phone | UserID | |
---|---|---|---|
2 | def@gmail.com | NaN | 2 |
4 | def@gmail.com | 987654 | NaN |
Thank you.
Since you wish to keep working in the same dataframe, and because there is the possibility in overlap between types of groups, I suggest creating two extra columns with numbered groups:
df['email_groups'] = df.groupby(df.email).ngroup()
df['phone_groups'] = df.groupby(df.phone).ngroup()
Result:
index | phone | UserID | email_groups | phone_groups | ||
---|---|---|---|---|---|---|
0 | 1 | abc@gmail.com | 123456 | 1 | 0 | 0 |
1 | 2 | def@gmail.com | nan | 2 | 1 | -1 |
2 | 3 | nan | 123456 | nan | -1 | 0 |
3 | 4 | def@gmail.com | 987654 | nan | 1 | 1 |
4 | 5 | nan | nan | 1 | -1 | -1 |
Note that empty values will be classified with -1
. You can count the group sizes with, for example, df['phone_groups'].value_counts()
, and filter by group number, etc.
I am not sure an elegant pandas-only solution exists. Here we create a couple of helper functions first then apply then to the df. The main idea is to have a dictionary where we keep track of group ids that we assign to tuples (email,phone,UserID)
based on partial match in any of the fields
First we load the data
import pandas as pd
import numpy as np
from io import StringIO
data = StringIO(
"""
index email phone UserID
1 abc@gmail.com 123456 1
2 def@gmail.com NaN 2
3 NaN 123456 NaN
4 def@gmail.com 987654 NaN
5 NaN NaN 1
""")
df = pd.read_csv(data, delim_whitespace=True)
Next we define the partial_match
function and test it
def partial_match(key1, key2):
'''
Return True if any of the elements of key1 and key2 match
'''
for f1, f2 in zip(key1, key2):
if f1 == f2:
return True
return False
# a bit of testing
print(partial_match(('abc@gmail.com',123456.0,.0),(np.NaN,123456.0,np.NaN))) # True
print(partial_match(('abc@gmail.com',123456.0,.0),('def@gmail.com', np.NaN, 2.0))) # False
Next we define a global dictionary where we would keep group ids and a function to update it, with a bit of testing
# global dictionary of group ids
groups = {}
def assign_group(key):
'''
Assign a group number to a new key, either existing if there is a partial match
or a new one. Also return the group number for the key
'''
# first element is assigned 0
if len(groups) == 0:
groups[key] = 0
return groups[key]
# see if we already have a partial match
for k in groups:
if partial_match(k,key):
groups[key] = groups[k]
return groups[key]
# no match -- new group
groups[key] = max(groups.values())+1
return groups[key]
# a bit of testing
assign_group(('abc@gmail.com',123456.0,.0))
assign_group((np.NaN,123456.0,np.NaN))
assign_group(('def@gmail.com', np.NaN, 2.0))
print(groups)
Testing returns
{('abc@gmail.com', 123456.0, 0.0): 0, (nan, 123456.0, nan): 0, ('def@gmail.com', nan, 2.0): 1}
Now ready for the main act. We apply assign_group
to each row in turn, recording the result in df['group_id']
# populate 'groups' with the data from the df, and add the group id to the df
groups = {}
df['group_id'] =df.apply(lambda row: assign_group((row['email'],row['phone'],row['UserID'])), axis=1)
df
and we get this
index email phone UserID group_id
-- ------- ------------- ------- -------- ----------
0 1 abc@gmail.com 123456 1 0
1 2 def@gmail.com nan 2 1
2 3 nan 123456 nan 0
3 4 def@gmail.com 987654 nan 1
4 5 nan nan 1 0
Now you can group on group_id
eg:
df.groupby('group_id').count()
returns
index email phone UserID
group_id
0 3 1 2 2
1 2 2 1 1
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.