简体   繁体   English

如何合并列并删除重复项但保留唯一值?

[英]How to merge columns and delete duplicates but keep unique values?

I want to merge columns based on same IDs and want to make sure to consolidate the rows into just one row (per ID).我想合并基于相同 ID 的列,并希望确保将行合并为一行(每个 ID)。 Can anyone help me to merge the columns for duplicates and non-duplicates?谁能帮我合并重复和非重复的列?

Given:鉴于:

ID      Name     Degree       AM_Class     PM_Class     Online_Class
01      Kathy    Biology      Bio101       NaN          NaN
01      Kathy    Biology      NaN          Chem101      NaN
02      James    Chemistry    NaN          Chem101      NaN
03      Henry    Business     Bus100       NaN          NaN
03      Henry    Business     NaN          Math100      NaN
03      Henry    Business     NaN          NaN          Acct100

Expected Output:预期 Output:

ID      Name     Degree       AM_Class     PM_Class     Online_Class
01      Kathy    Biology      Bio101       Chem101      NaN
02      James    Chemistry    NaN          Chem101      NaN
03      Henry    Business     Bus100       Math100      Acct100

I tried to use:我尝试使用:

df = df.groupby(['Name','Degree','ID'])['AM_Class', 'PM_Class', 'Online_Class'].apply(', '.join).reset_index()

but seems like it's giving an error..但似乎它给出了一个错误..

Here is your data:这是您的数据:

df = pd.DataFrame({'ID': ['01', '01', '02', '03', '03', '03'],
                   'Degree': ['Biology', 'Biology', 'Chemistry', 'Business', 'Business', 'Business'],
                   'Name': ['Kathy', 'Kathy', 'James', 'Henry', 'Henry', 'Henry'],
                   'AM_Class': ['Bio101', np.nan, np.nan, 'Bus100', np.nan, np.nan],
                   'PM_Class': [np.nan, 'Chem101', 'Chem101', np.nan, 'Math100', np.nan],
                   'Online_Class': [np.nan, np.nan, np.nan, np.nan, np.nan, 'Acct100']})

You can separate the data frames, remove the NaN values, then rejoin them.您可以分离数据框,删除 NaN 值,然后重新加入它们。

The reduce() function allows the merge to be performed iteratively, without having to merge the data frames one by one. reduce() function 允许迭代地执行合并,而不必一一合并数据帧。

from functools import reduce

# Separate the data frames
df_student = df[['ID', 'Name', 'Degree']]
df_AM = df[['ID', 'Name', 'AM_Class']]
df_PM = df[['ID', 'Name', 'PM_Class']]
df_OL = df[['ID', 'Name', 'Online_Class']]

# List of data frames
dfs = [df_student, df_AM, df_PM, df_OL]

# Remove all NaNs
for df in dfs:
    df.dropna(inplace=True)

# Merge dataframes without the NaNs
df_merged = reduce(lambda left, right: pd.merge(left, right, how='left', on=['ID', 'Name']), dfs)


    ID  Name    Degree      AM_Class    PM_Class    Online_Class
0   01  Kathy   Biology     Bio101      Chem101     NaN
1   01  Kathy   Biology     Bio101      Chem101     NaN
2   02  James   Chemistry   NaN         Chem101     NaN
3   03  Henry   Business    Bus100      Math100     Acct100
4   03  Henry   Business    Bus100      Math100     Acct100
5   03  Henry   Business    Bus100      Math100     Acct100

Then you just need to remove the duplicates.然后你只需要删除重复项。

df_merged.drop_duplicates(inplace=True).reset_index()

This is the result:这是结果:

     ID Name    Degree      AM_Class    PM_Class    Online_Class
0    01 Kathy   Biology     Bio101      Chem101     NaN
1    02 James   Chemistry   NaN         Chem101     NaN
2    03 Henry   Business    Bus100      Math100     Acct100

You may ffill rows first and then drop duplicates while keeping the last occurrence of duplicates,您可以ffill行,然后删除重复项,同时保留最后一次出现的重复项,

df.groupby(['ID']).ffill().drop_duplicates(subset='Name', keep='last')

we can use pandas pivot_table for this problem your data looks like this对于这个问题,我们可以使用 pandas pivot_table你的数据看起来像这样

>>> data = {'Name': ['Kathy','Kathy','James','Henry','Henry','Henry'],
        'Degree': ['Biology','Biology','Chemistry','Business','Business','Business'],
        'AM_Class': ['Bio101', np.nan, np.nan, 'Bus100', np.nan, np.nan],
        'PM_Class': [np.nan, 'Chem101', 'Chem101', np.nan, 'Math100', np.nan],
        'Online_Class': [np.nan, np.nan, np.nan, np.nan, np.nan, 'Acct100'],
        
       }
>>> df = pd.DataFrame(data)

>>> print(df)

 Name     Degree AM_Class PM_Class Online_Class
0  Kathy    Biology   Bio101      NaN          NaN
1  Kathy    Biology      NaN  Chem101          NaN
2  James  Chemistry      NaN  Chem101          NaN
3  Henry   Business   Bus100      NaN          NaN
4  Henry   Business      NaN  Math100          NaN
5  Henry   Business      NaN      NaN      Acct100

First we can replace all NaN with null string首先我们可以用null字符串替换所有NaN

>>> df.fillna('', inplace=True)

>>> print(df)

Name     Degree AM_Class PM_Class Online_Class
0     0    Biology   Bio101                      
1     1    Biology           Chem101             
2     2  Chemistry           Chem101             
3     3   Business   Bus100                      
4     4   Business           Math100             
5     5   Business                        Acct100

I am doing this because while using pivot_table function I would like to use np.sum function which will concatenate strings in the pandas.series.我这样做是因为在使用 pivot_table function 时我想使用np.sum function 它将连接 pandas.series 中的字符串。 Having the np.nan as it is will raise exception.拥有np.nan会引发异常。

Now lets make the pivot table with Name being the group-by column.现在让我们制作 pivot 表,其中Name是分组列。

>>> df2 = pd.pivot_table(data=df, index=['Name'], aggfunc={'Degree':np.unique, 'AM_Class':np.sum, 'PM_Class':np.sum, 'Online_Class':np.sum})

>>> print(df2)

AM_Class     Degree Online_Class PM_Class
Name                                           
Henry   Bus100   Business      Acct100  Math100
James           Chemistry               Chem101
Kathy   Bio101    Biology               Chem101

We have to replace the nulls with np.nan - since that is the format that is asked for.我们必须用 np.nan 替换空值- 因为这是要求的格式。

>>> df2.replace('', np.nan, inplace=True)

>>> print(df2)

AM_Class     Degree Online_Class PM_Class
Name                                           
Henry   Bus100   Business      Acct100  Math100
James      NaN  Chemistry          NaN  Chem101
Kathy   Bio101    Biology          NaN  Chem101

Observing the new dataframe df2 , it seems we have to make the following changes观察新的 dataframe df2 ,看来我们必须进行以下更改

  • Since the name column has become the Index - we have to make a Name column由于名称列已成为索引 - 我们必须创建一个名称
  • Add a RangeIndex添加范围索引
  • Column order has to be restored必须恢复列顺序
>>> df2['Name'] = df2.index

>>> cols = [ 'Name', 'Degree', 'AM_Class',  'PM_Class', 'Online_Class']

>>> df2 = df2[cols]

>>> print(df2)

 Name     Degree AM_Class PM_Class Online_Class
Name                                                  
Henry  Henry   Business   Bus100  Math100      Acct100
James  James  Chemistry      NaN  Chem101          NaN
Kathy  Kathy    Biology   Bio101  Chem101          NaN

>>> df2.set_index(pd.RangeIndex(start=0,stop=3,step=1), inplace=True)

>>> print(df2)

 Name     Degree AM_Class PM_Class Online_Class
0  Henry   Business   Bus100  Math100      Acct100
1  James  Chemistry      NaN  Chem101          NaN
2  Kathy    Biology   Bio101  Chem101          NaN

If need first non missing values per groups use GroupBy.first :如果需要每个组的第一个非缺失值,请使用GroupBy.first

df = df.groupby(['ID','Name','Degree'], as_index=False).first()
print (df)
   ID   Name     Degree AM_Class PM_Class Online_Class
0  01  Kathy    Biology   Bio101  Chem101         None
1  02  James  Chemistry     None  Chem101         None
2  03  Henry   Business   Bus100  Math100      Acct100

Or if need all unique values without missing values per groups use custom lambda function in GroupBy.agg for processing each column separately by Series.dropna , removed duplicated by dict.fromkeys and last join values by , :或者,如果需要每个组没有缺失值的所有唯一值,请使用 GroupBy.agg 中的自定义 lambda GroupBy.agg用于分别处理每一列dict.fromkeys ,由Series.dropna删除重复,最后由,连接值:

f = lambda x: ', '.join(dict.fromkeys(x.dropna()))
df = df.groupby(['ID','Name','Degree'], as_index=False).agg(f).replace('', np.nan)

Difference is possible see in changed data:在更改的数据中可以看到差异:

print (df)
   ID   Name     Degree AM_Class PM_Class Online_Class
0  01  Kathy    Biology   Bio101      NaN          NaN
1  01  Kathy    Biology      NaN  Chem101          NaN
2  02  James  Chemistry      NaN  Chem101          NaN
3  03  Henry   Business   Bus100      NaN          NaN
4  03  Henry   Business      NaN  Math100      Acct100
5  03  Henry   Business      NaN  Math200      Acct100

df1 = df.groupby(['ID','Name','Degree'], as_index=False).first()
print (df1)
   ID   Name     Degree AM_Class PM_Class Online_Class
0  01  Kathy    Biology   Bio101  Chem101         None
1  02  James  Chemistry     None  Chem101         None
2  03  Henry   Business   Bus100  Math100      Acct100


f = lambda x: ', '.join(dict.fromkeys(x.dropna()))
df2 = df.groupby(['ID','Name','Degree'], as_index=False).agg(f).replace('', np.nan)
print (df2)
   ID   Name     Degree AM_Class          PM_Class Online_Class
0  01  Kathy    Biology   Bio101           Chem101          NaN
1  02  James  Chemistry      NaN           Chem101          NaN
2  03  Henry   Business   Bus100  Math100, Math200      Acct100

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM