In pandas, how to group row together if any value in the columns (or subset of columns) is common?

Question

I would like to group the row together based on the common value in any column.

I have the table that look like this

index	email	phone	UserID
1	abc@gmail.com	123456	1
2	def@gmail.com	NaN	2
3	NaN	123456	NaN
4	def@gmail.com	987654	NaN
5	NaN	NaN	1

How can I group together index 1,3,5 (because index 1 and 3 have common phone number, and index 1 and 5 have common UserID)

index	email	phone	UserID
1	abc@gmail.com	123456	1
3	NaN	123456	NaN
5	NaN	NaN	1

and group together index 2, 4 (because index 2 and 4 have common email)

index	email	phone	UserID
2	def@gmail.com	NaN	2
4	def@gmail.com	987654	NaN

Thank you.

Answer 1

Since you wish to keep working in the same dataframe, and because there is the possibility in overlap between types of groups, I suggest creating two extra columns with numbered groups:

df['email_groups'] = df.groupby(df.email).ngroup()
df['phone_groups'] = df.groupby(df.phone).ngroup()

Result:

	index	email	phone	UserID	email_groups	phone_groups
0	1	abc@gmail.com	123456	1	0	0
1	2	def@gmail.com	nan	2	1	-1
2	3	nan	123456	nan	-1	0
3	4	def@gmail.com	987654	nan	1	1
4	5	nan	nan	1	-1	-1

Note that empty values will be classified with -1 . You can count the group sizes with, for example, df['phone_groups'].value_counts() , and filter by group number, etc.

Answer 2

I am not sure an elegant pandas-only solution exists. Here we create a couple of helper functions first then apply then to the df. The main idea is to have a dictionary where we keep track of group ids that we assign to tuples (email,phone,UserID) based on partial match in any of the fields

First we load the data

import pandas as pd
import numpy as np
from io import StringIO
data = StringIO(
"""
index   email   phone   UserID
1   abc@gmail.com   123456  1
2   def@gmail.com   NaN 2
3   NaN 123456  NaN
4   def@gmail.com   987654  NaN
5   NaN NaN 1
""")
df = pd.read_csv(data, delim_whitespace=True)

Next we define the partial_match function and test it

def partial_match(key1, key2):
    ''' 
    Return True if any of the elements of key1 and key2 match
    '''
    for f1, f2 in zip(key1, key2):
        if f1 == f2:
            return True
    return False

# a bit of testing
print(partial_match(('abc@gmail.com',123456.0,.0),(np.NaN,123456.0,np.NaN))) # True
print(partial_match(('abc@gmail.com',123456.0,.0),('def@gmail.com', np.NaN, 2.0))) # False

Next we define a global dictionary where we would keep group ids and a function to update it, with a bit of testing

# global dictionary of group ids
groups = {}

def assign_group(key):
    '''
    Assign a group number to a new key, either existing if there is a partial match
    or a new one. Also return the group number for the key
    '''

    # first element is assigned 0
    if len(groups) == 0:
        groups[key] = 0
        return groups[key]

    # see if we already have a partial match
    for k in groups:
        if partial_match(k,key):
            groups[key] = groups[k]
            return groups[key]

    # no match -- new group
    groups[key] = max(groups.values())+1
    return groups[key]


# a bit of testing
assign_group(('abc@gmail.com',123456.0,.0))
assign_group((np.NaN,123456.0,np.NaN))
assign_group(('def@gmail.com', np.NaN, 2.0))
print(groups)

Testing returns

{('abc@gmail.com', 123456.0, 0.0): 0, (nan, 123456.0, nan): 0, ('def@gmail.com', nan, 2.0): 1}

Now ready for the main act. We apply assign_group to each row in turn, recording the result in df['group_id']

# populate 'groups' with the data from the df, and add the group id to the df
groups = {}
df['group_id'] =df.apply(lambda row:  assign_group((row['email'],row['phone'],row['UserID'])), axis=1)
df

and we get this

      index  email            phone    UserID    group_id
--  -------  -------------  -------  --------  ----------
 0        1  abc@gmail.com   123456         1           0
 1        2  def@gmail.com      nan         2           1
 2        3  nan             123456       nan           0
 3        4  def@gmail.com   987654       nan           1
 4        5  nan                nan         1           0

Now you can group on group_id eg:

df.groupby('group_id').count()

returns

    index   email   phone   UserID
group_id                
0   3       1       2       2
1   2       2       1       1

In pandas, how to group row together if any value in the columns (or subset of columns) is common?

Question

2 answers

solution1
0 2021-03-07 09:26:43

solution2
0 2021-03-07 10:13:29

In pandas, how to group row together if any value in the columns (or subset of columns) is common?

Question

2 answers

solution1 0 2021-03-07 09:26:43

solution2 0 2021-03-07 10:13:29

solution1
0 2021-03-07 09:26:43

solution2
0 2021-03-07 10:13:29