Group by multiple columns and get median of dict elements as a new column in pandas

Question

I have a dataframe that looks like below:

+-------+----------+-------------+-----------------------------------------------------+
| item  | category | subcategory |                     sales_count                     |
+-------+----------+-------------+-----------------------------------------------------+
| ItemA |        0 | p           | {store1:50,store2:70,store3:90,store4:44,store5:76} |
| ItemB |        0 | p           | {store2:22,store3:15,store4:77,store5:0}            |
| ItemC |        0 | p           | {store1:46,store2:13,store3:9,store4:87,store5:45}  |
| ItemD |        0 | q           | {store1:88,store2:16,store4:5,store5:2}             |
| ItemE |        0 | q           | {store1:7,store2:55}                                |
| ItemF |        1 | t           | {store3:25,store4:75,store5:87}                     |
| ItemG |        1 | t           | {store1:32,store3:66,store4:87,store5:0}            |
| ItemH |        1 | t           | {store1:54,store2:33,store3:12,store4:67,store5:8}  |
+-------+----------+-------------+-----------------------------------------------------+

I would like to generate a new column with the median of sales count across category and subcategory.

ie, 'median_across_group' value for itemA should be the median of all sales_count in category = 0 & subcategory = p.

How to achieve this groupby and median for dict elements?


+-------+----------+-------------+-----------------------------------------------------+---------------------------------------------+
| item  | category | subcategory |                     sales_count                     |             median_across_group             |
+-------+----------+-------------+-----------------------------------------------------+---------------------------------------------+
| ItemA |        0 | p           | {store1:50,store2:70,store3:90,store4:44,store5:76} | <median of category 0, subcategory p items> |
| ItemB |        0 | p           | {store2:22,store3:15,store4:77,store5:0}            | <median of category 0, subcategory p items> |
| ItemC |        0 | p           | {store1:46,store2:13,store3:9,store4:87,store5:45}  | <median of category 0, subcategory p items> |
| ItemD |        0 | q           | {store1:88,store2:16,store4:5,store5:2}             | <median of category 0, subcategory q items> |
| ItemE |        0 | q           | {store1:7,store2:55}                                | <median of category 0, subcategory q items> |
| ItemF |        1 | t           | {store3:25,store4:75,store5:87}                     | <median of category 1, subcategory t items> |
| ItemG |        1 | t           | {store1:32,store3:66,store4:87,store5:0}            | <median of category 1, subcategory t items> |
| ItemH |        1 | t           | {store1:54,store2:33,store3:12,store4:67,store5:8}  | <median of category 1, subcategory t items> |
+-------+----------+-------------+-----------------------------------------------------+---------------------------------------------+

Answer 1

You might want to try this out:

Code

import pandas as pd


df = pd.DataFrame({
    'item': ['ItemA', 'ItemB', 'ItemC', 'ItemD', 'ItemE', 'ItemF', 'ItemG', 'ItemH', ],
    'category': [0, 0, 0, 0, 0, 1, 1, 1],
    'subcategory': ['p', 'p', 'p', 'q', 'q', 't', 't', 't'],
    'sales_count': [
        {'store1':50,'store2':70,'store3':90,'store4':44,'store5':76},
        {'store2':22,'store3':15,'store4':77,'store5':0},
        {'store1':46,'store2':13,'store3':9,'store4':87,'store5':45},
        {'store1':88,'store2':16,'store4':5,'store5':2},
        {'store1':7,'store2':55},
        {'store3':25,'store4':75,'store5':87},
        {'store1':32,'store3':66,'store4':87,'store5':0},
        {'store1':54,'store2':33,'store3':12,'store4':67,'store5':8}
    ]
})

median = {}
for idx, row in df.iterrows():

    key_combo   = str(row['category']) + str(row['subcategory'])
    values_list = list(row['sales_count'].values())

    median[key_combo] = (
        values_list                         # Add the list if key not present
        if key_combo not in median else
        median[key_combo] + (values_list)   # Append the new list if key present
    )

for key, values in median.items():
    median[key] = sorted(values)[len(values) // 2]  # Calculate median and store in dict

def apply_median(x):
    return median[str(x.category) + str(x.subcategory)]

df['Median'] = df[['category', 'subcategory']].apply(apply_median, axis=1)

print(df)

Output

    item  category subcategory                                        sales_count  Median
0  ItemA         0           p  {'store1': 50, 'store2': 70, 'store3': 90, 'st...      46
1  ItemB         0           p  {'store2': 22, 'store3': 15, 'store4': 77, 'st...      46
2  ItemC         0           p  {'store1': 46, 'store2': 13, 'store3': 9, 'sto...      46
3  ItemD         0           q  {'store1': 88, 'store2': 16, 'store4': 5, 'sto...      16
4  ItemE         0           q                        {'store1': 7, 'store2': 55}      16
5  ItemF         1           t         {'store3': 25, 'store4': 75, 'store5': 87}      54
6  ItemG         1           t  {'store1': 32, 'store3': 66, 'store4': 87, 'st...      54
7  ItemH         1           t  {'store1': 54, 'store2': 33, 'store3': 12, 'st...      54

Answer 2

I found a simpler way,

def get_dict_median(x):
    flat_list = [i for k in list(x) for i in k] #Flatten all lists into one single list
    return(np.median(flat_list))

df['sales_count_list'] = df['sales_count'].apply(lambda x: list(x.values()))
df['group_median']=df.groupby(['category','subcategory'])['sales_count_list'].transform(get_dict_median)

Group by multiple columns and get median of dict elements as a new column in pandas

Question

2 answers

solution1
-1 2021-11-23 06:49:50

Code

Output

solution2
-2 2021-11-23 07:35:33

Group by multiple columns and get median of dict elements as a new column in pandas

Question

2 answers

solution1 -1 2021-11-23 06:49:50

Code

Output

solution2 -2 2021-11-23 07:35:33

solution1
-1 2021-11-23 06:49:50

solution2
-2 2021-11-23 07:35:33