简体   繁体   中英

Eliminating duplicated rows by dropping varying columns and aggregating remaining rows

I have a dataframe that has duplicated rows simply because two columns are different from each other.

df
[A]    [B]   [C]   [D]  [E]
123    X     Y     5    A
135    D     E     4    B
434    R     F     3    C
434    E     Z     5    C

In the above example, column [A] should have unique values and is my key to determining duplicated rows. As shown, column [A] shows a repeat at 434 due to [B] and [C] containing different objects. As a result, column [D] is being split from 8 to 3 and 5 for each row and [E] is being repeated. (Column [D] is an arbitrary split based on other factors that aren't important to this example)

My goal is to drop the two columns causing the duplication and then aggregating columns [A] , [D] , and [E] . Is there a way I can use .groupby() and set rules for aggregating non-integer values (for column [E] ? Aggregate is probably not the best word as I'm simply just taking the repeated instance and brings it up a level. I'm thinking for column [E] setting rules where it outputs the first instance since both are unchanging.

I started off with the following method in mind: df.groupby('A').agg()

The example's output should show:

df_agg
[A]  [D]  [E]
123  5    A
135  4    B
434  8    C

This is as simple as a groupby + agg -

df.groupby('[A]', as_index=False).agg({'[D]' : sum, '[E]' : 'first'})

   [A]  [D] [E]
0  123    5   A
1  135    4   B
2  434    8   C

If [A] is the index, then change the groupby syntax a bit -

df.groupby(level=0).agg({'[D]' : sum, '[E]' : 'first'})

     [D] [E]
[A]         
123    5   A
135    4   B
434    8   C

Use, groupby with agg and a dictionary defined how to aggregate the columns.

df.groupby('[A]').agg({'[D]':'sum','[E]':'first'}).reset_index()

Output:

   [A]  [D] [E]
0  123    5   A
1  135    4   B
2  434    8   C

With this :-), then just select what you need from the result

df.groupby('[A]',as_index=False).agg(lambda x : x.head(1) if x.dtype=='object' else x.sum())
Out[275]: 
   [A] [B] [C]  [D] [E]
0  123   X   Y    5   A
1  135   D   E    4   B
2  434   R   F    8   C

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM