简体   繁体   中英

pandas dataframe, and using idmax() on the n most frequent values

I have a pandas dataframe that has days of the week for rows, and names for columns. Inside the dataframe are integers representing how many times that person has entered the store on that weekday. It looks like this:

    names   'Martha'  'Johnny'  'Chloe'  'Tim'
    'Mon'     3          2        0       7
    'Tue'     0          0        3       0
    'Wed'     1          12       3       0
    'Thu'     5          0        3       0

I want, for each customer, to rank the days of the week they tend to shop on, and pick the top two. In case of duplicates (for example Chloe) order doesn't matter, as long as two of of three possibilities are chosen. In case someone has only gone to the store on one day (for example Tim) I'd want the second spot to be null. Here is my desired output:

    names  'Most frequent'   '2nd most freq'
    'Martha'    'Thu'            'Mon'
    'Johnny'    'Wed'            'Mon'
    'Chloe'     'Tue'            'Thu'
    'Tim'       'Mon'             NaN

I've seen similar questions asking about extending argmax(), but not idmax().

My current plan (in pseudocode):

    for customer in dataframe:
        for i  = 0,1:
            if all elements zero:
                newdataframe[customer, i] = NaN
            else:
                newdataframe[customer, i] = dataframe.idxmax()[customer]
                dataframe[dataframe.idxmax()[customer], customer] = 0
         return newdataframe

I imagine someone with more experience than I could probably do this a bit more efficiently. What do you think? Is there a more efficient way?

Since you want also the 2nd most frequent day, you can define a custom function to do the sort for each column.

# your data
# ===========================
df

     Martha  Johnny  Chloe  Tim
Mon       3       2      0    7
Tue       0       0      3    0
Wed       1      12      3    0
Thu       5       0      3    0

# processing
# ======================
def func(col):
    # sort index according column values
    idx_sorted, _ = zip(*sorted(zip(col.index.values, col.values), key=lambda x: x[1]))
    return pd.Series({'most_frequent': idx_sorted[-1], 'second_most_freq': idx_sorted[-2]})

df.apply(func).T

       most_frequent second_most_freq
Martha           Thu              Mon
Johnny           Wed              Mon
Chloe            Thu              Wed
Tim              Mon              Thu

Edit:

# processing
# ======================
import numpy as np

def func(col):
    # sort index according column values
    col = col[col > 0]
    idx_sorted, _ = zip(*sorted(zip(col.index.values, col.values), key=lambda x: x[1]))
    d = dict(zip(np.arange(len(idx_sorted)), idx_sorted[::-1]))
    return pd.Series({'most_frequent': d.get(0, np.nan), 'second_most_freq': d.get(1, np.nan)})

df.apply(func).T

       most_frequent second_most_freq
Martha           Thu              Mon
Johnny           Wed              Mon
Chloe            Thu              Wed
Tim              Mon              NaN
df.stack(-1).groupby(level=-1).transform(lambda x: x.argsort(0)).reset_index().pivot('level_1',0).sort_index(axis = 1, ascending = False)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM