在 pandas 中，如果列（或列的子集）中的任何值是常見的，如何將行組合在一起？

Question

我想根據任何列中的共同值將行分組在一起。

我有一張看起來像這樣的桌子

指數	email	電話	用戶身份
1	abc@gmail.com	123456	1
2	def@gmail.com	鈉	2
3	鈉	123456	鈉
4	def@gmail.com	987654	鈉
5	鈉	鈉	1

如何將索引 1、3、5 組合在一起（因為索引 1 和 3 有共同的電話號碼，而索引 1 和 5 有共同的 UserID）

指數	email	電話	用戶身份
1	abc@gmail.com	123456	1
3	鈉	123456	鈉
5	鈉	鈉	1

並將索引 2、4 組合在一起（因為索引 2 和 4 有共同的電子郵件）

指數	email	電話	用戶身份
2	def@gmail.com	鈉	2
4	def@gmail.com	987654	鈉

謝謝你。

Answer 1

由於您希望繼續在同一個 dataframe 中工作，並且由於組類型之間存在重疊的可能性，我建議創建兩個帶有編號組的額外列：

df['email_groups'] = df.groupby(df.email).ngroup()
df['phone_groups'] = df.groupby(df.phone).ngroup()

結果：

	指數	email	電話	用戶身份	email_groups	電話組
0	1	abc@gmail.com	123456	1	0	0
1	2	def@gmail.com	楠	2	1	-1
2	3	楠	123456	楠	-1	0
3	4	def@gmail.com	987654	楠	1	1
4	5	楠	楠	1	-1	-1

請注意，空值將使用-1進行分類。 您可以使用例如df['phone_groups'].value_counts()來計算組大小，並按組號等進行過濾。

Answer 2

我不確定是否存在優雅的 pandas-only 解決方案。 在這里，我們首先創建幾個輔助函數，然后應用到 df. 主要思想是有一個字典，我們根據任何字段中的部分匹配來跟蹤我們分配給元組(email,phone,UserID)的組ID

首先我們加載數據

import pandas as pd
import numpy as np
from io import StringIO
data = StringIO(
"""
index   email   phone   UserID
1   abc@gmail.com   123456  1
2   def@gmail.com   NaN 2
3   NaN 123456  NaN
4   def@gmail.com   987654  NaN
5   NaN NaN 1
""")
df = pd.read_csv(data, delim_whitespace=True)

接下來我們定義partial_match function 並測試它

def partial_match(key1, key2):
    ''' 
    Return True if any of the elements of key1 and key2 match
    '''
    for f1, f2 in zip(key1, key2):
        if f1 == f2:
            return True
    return False

# a bit of testing
print(partial_match(('abc@gmail.com',123456.0,.0),(np.NaN,123456.0,np.NaN))) # True
print(partial_match(('abc@gmail.com',123456.0,.0),('def@gmail.com', np.NaN, 2.0))) # False

接下來我們定義一個全局字典，我們將在其中保留組 ID 和 function 來更新它，並進行一些測試

# global dictionary of group ids
groups = {}

def assign_group(key):
    '''
    Assign a group number to a new key, either existing if there is a partial match
    or a new one. Also return the group number for the key
    '''

    # first element is assigned 0
    if len(groups) == 0:
        groups[key] = 0
        return groups[key]

    # see if we already have a partial match
    for k in groups:
        if partial_match(k,key):
            groups[key] = groups[k]
            return groups[key]

    # no match -- new group
    groups[key] = max(groups.values())+1
    return groups[key]


# a bit of testing
assign_group(('abc@gmail.com',123456.0,.0))
assign_group((np.NaN,123456.0,np.NaN))
assign_group(('def@gmail.com', np.NaN, 2.0))
print(groups)

測試返回

{('abc@gmail.com', 123456.0, 0.0): 0, (nan, 123456.0, nan): 0, ('def@gmail.com', nan, 2.0): 1}

現在准備開始主要表演。 我們依次對每一行應用assign_group ，將結果記錄在df['group_id']

# populate 'groups' with the data from the df, and add the group id to the df
groups = {}
df['group_id'] =df.apply(lambda row:  assign_group((row['email'],row['phone'],row['UserID'])), axis=1)
df

我們得到了這個

      index  email            phone    UserID    group_id
--  -------  -------------  -------  --------  ----------
 0        1  abc@gmail.com   123456         1           0
 1        2  def@gmail.com      nan         2           1
 2        3  nan             123456       nan           0
 3        4  def@gmail.com   987654       nan           1
 4        5  nan                nan         1           0

現在您可以在group_id上進行分組，例如：

df.groupby('group_id').count()

返回

    index   email   phone   UserID
group_id                
0   3       1       2       2
1   2       2       1       1

在 pandas 中，如果列（或列的子集）中的任何值是常見的，如何將行組合在一起？

問題描述

2 個解決方案

解決方案1
0 2021-03-07 09:26:43

解決方案2
0 2021-03-07 10:13:29

在 pandas 中，如果列（或列的子集）中的任何值是常見的，如何將行組合在一起？

問題描述

2 個解決方案

解決方案1 0 2021-03-07 09:26:43

解決方案2 0 2021-03-07 10:13:29

解決方案1
0 2021-03-07 09:26:43

解決方案2
0 2021-03-07 10:13:29