简体   繁体   中英

How to combine rows with same list of unique elements in strings separated by delimiter?

I have a python data frame with values in a string separated by delimiter and a corresponding column with numeric value. I want to combine the rows with same list of unique elements in the string irrespective of their sequence and aggregate the corresponding column to that row

The data frame appears like this:

df = pd.DataFrame([['Organic Search, Direct', 171], ['Organic Search, Direct, Not Found', 3], ['Direct, Organic Search', 389], ['Direct, Organic Search, Not Found', 8]], columns=['A', 'B'])

df

And I want the result to appear like this:

df = pd.DataFrame([['Direct, Organic Search',560], ['Direct, Organic Search, Not Found',11]], columns = ['A','B'])

Initially I thought of using a for loop to traverse through each row of the data frame and store the list of elements in an array alphabetically, then combining the rows based on that processed row.

However, I couldn't formulate this into a code.

You can easily do this. Example:

import pandas as pd

df = pd.DataFrame([['Organic Search, Direct', 171], ['Organic Search, Direct, Not Found', 3], ['Direct, Organic Search', 389], ['Direct, Organic Search, Not Found', 8]], columns=['A', 'B'])

print(df)
Out:
      A                                 B
0   Organic Search, Direct             171
1   Organic Search, Direct, Not Found   3
2   Direct, Organic Search             389
3   Direct, Organic Search, Not Found   8
def foo(x):
    x = x.split(', ')
    x.sort()
    return tuple(x)

df.groupby(df.A.apply(foo))['B'].sum()
out:
               A
(Direct, Not Found, Organic Search)     11
(Direct, Organic Search)               560

I can't take credit for this, but I'm not sure how to present it outside of a separate answer.

I am also new at Python. Using padu's logic code , I've adapted it to your original post's data.

I also substituted ['B'] with .reset_index() to keep the original columns, as the sums were previously merged with the newly grouped values in a single column. I added also added numeric_only=True because without specifying a column to sum, it would normally assume to stick with numeric columns, but the assumption is now a deprecated feature, as per this message I got:

The default value of numeric_only in DataFrameGroupBy.sum is deprecated. In a future version, numeric_only will default to False. Either specify numeric_only or select only columns which should be valid for the function. df = df.groupby(df['MCF Channel Grouping Path_1'].apply(sort_groupings)).sum().reset_index()

You may also want to add df = at the beginning of the last line, as below, to keep the result in the original variable name. Missing this step caused me to lose lots of time while learning to debug.

import pandas as pd

df = pd.DataFrame([
['Direct, Organic Search',171],
['Direct, Organic Search, Not Found',4],
['Direct, Organic Search, Not Found, Referral',12],
['Direct, Organic Search, Not Found, Referral, Email',2],
['Organic Search, Direct',389],
['Organic Search, Direct, Not Found',3],
['Organic Search, Direct, Referral',76],
['Organic Search, Direct, Referral, Social Network',1]],
columns =['MCF Channel Grouping Path_1', 'Conversions'])

def sort_groupings(row_group):
    row_group = row_group.split(', ') # Converts string into list using ', ' as the separator       
    row_group.sort()                  # Sorts newly, made list in ascending order
    return tuple(row_group)           # Converts list to a tuple (required by the 'apply' function), and returns it

df = df.groupby(df['MCF Channel Grouping Path_1'].apply(sort_groupings)).sum(numeric_only=True).reset_index()

# groupby: column in which to group common values
# apply: call a function to run on values specified
# reset_index: reinstates the first column in the final df

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM