简体   繁体   中英

In pandas, how to group row together if any value in the columns (or subset of columns) is common?

I would like to group the row together based on the common value in any column.

I have the table that look like this

index email phone UserID
1 abc@gmail.com 123456 1
2 def@gmail.com NaN 2
3 NaN 123456 NaN
4 def@gmail.com 987654 NaN
5 NaN NaN 1

How can I group together index 1,3,5 (because index 1 and 3 have common phone number, and index 1 and 5 have common UserID)

index email phone UserID
1 abc@gmail.com 123456 1
3 NaN 123456 NaN
5 NaN NaN 1

and group together index 2, 4 (because index 2 and 4 have common email)

index email phone UserID
2 def@gmail.com NaN 2
4 def@gmail.com 987654 NaN

Thank you.

Since you wish to keep working in the same dataframe, and because there is the possibility in overlap between types of groups, I suggest creating two extra columns with numbered groups:

df['email_groups'] = df.groupby(df.email).ngroup()
df['phone_groups'] = df.groupby(df.phone).ngroup()

Result:

index email phone UserID email_groups phone_groups
0 1 abc@gmail.com 123456 1 0 0
1 2 def@gmail.com nan 2 1 -1
2 3 nan 123456 nan -1 0
3 4 def@gmail.com 987654 nan 1 1
4 5 nan nan 1 -1 -1

Note that empty values will be classified with -1 . You can count the group sizes with, for example, df['phone_groups'].value_counts() , and filter by group number, etc.

I am not sure an elegant pandas-only solution exists. Here we create a couple of helper functions first then apply then to the df. The main idea is to have a dictionary where we keep track of group ids that we assign to tuples (email,phone,UserID) based on partial match in any of the fields

First we load the data

import pandas as pd
import numpy as np
from io import StringIO
data = StringIO(
"""
index   email   phone   UserID
1   abc@gmail.com   123456  1
2   def@gmail.com   NaN 2
3   NaN 123456  NaN
4   def@gmail.com   987654  NaN
5   NaN NaN 1
""")
df = pd.read_csv(data, delim_whitespace=True)

Next we define the partial_match function and test it

def partial_match(key1, key2):
    ''' 
    Return True if any of the elements of key1 and key2 match
    '''
    for f1, f2 in zip(key1, key2):
        if f1 == f2:
            return True
    return False

# a bit of testing
print(partial_match(('abc@gmail.com',123456.0,.0),(np.NaN,123456.0,np.NaN))) # True
print(partial_match(('abc@gmail.com',123456.0,.0),('def@gmail.com', np.NaN, 2.0))) # False

Next we define a global dictionary where we would keep group ids and a function to update it, with a bit of testing

# global dictionary of group ids
groups = {}

def assign_group(key):
    '''
    Assign a group number to a new key, either existing if there is a partial match
    or a new one. Also return the group number for the key
    '''

    # first element is assigned 0
    if len(groups) == 0:
        groups[key] = 0
        return groups[key]

    # see if we already have a partial match
    for k in groups:
        if partial_match(k,key):
            groups[key] = groups[k]
            return groups[key]

    # no match -- new group
    groups[key] = max(groups.values())+1
    return groups[key]


# a bit of testing
assign_group(('abc@gmail.com',123456.0,.0))
assign_group((np.NaN,123456.0,np.NaN))
assign_group(('def@gmail.com', np.NaN, 2.0))
print(groups)

Testing returns

{('abc@gmail.com', 123456.0, 0.0): 0, (nan, 123456.0, nan): 0, ('def@gmail.com', nan, 2.0): 1}

Now ready for the main act. We apply assign_group to each row in turn, recording the result in df['group_id']

# populate 'groups' with the data from the df, and add the group id to the df
groups = {}
df['group_id'] =df.apply(lambda row:  assign_group((row['email'],row['phone'],row['UserID'])), axis=1)
df

and we get this

      index  email            phone    UserID    group_id
--  -------  -------------  -------  --------  ----------
 0        1  abc@gmail.com   123456         1           0
 1        2  def@gmail.com      nan         2           1
 2        3  nan             123456       nan           0
 3        4  def@gmail.com   987654       nan           1
 4        5  nan                nan         1           0

Now you can group on group_id eg:

df.groupby('group_id').count()

returns

    index   email   phone   UserID
group_id                
0   3       1       2       2
1   2       2       1       1

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM