简体   繁体   English

pandas数据框,并在n个最常用的值上使用idmax()

[英]pandas dataframe, and using idmax() on the n most frequent values

I have a pandas dataframe that has days of the week for rows, and names for columns. 我有一个pandas数据框,该行的星期几是行,列的名字是。 Inside the dataframe are integers representing how many times that person has entered the store on that weekday. 数据框内是整数,表示该人在该工作日进入商店的次数。 It looks like this: 看起来像这样:

    names   'Martha'  'Johnny'  'Chloe'  'Tim'
    'Mon'     3          2        0       7
    'Tue'     0          0        3       0
    'Wed'     1          12       3       0
    'Thu'     5          0        3       0

I want, for each customer, to rank the days of the week they tend to shop on, and pick the top two. 我想为每个客户确定他们倾向于购物的一周中的哪几天,并选择前两个。 In case of duplicates (for example Chloe) order doesn't matter, as long as two of of three possibilities are chosen. 如果重复(例如Chloe),则顺序无关紧要,只要选择三种可能性中的两种即可。 In case someone has only gone to the store on one day (for example Tim) I'd want the second spot to be null. 万一某人一天才去商店(例如蒂姆),我希望第二个位置为空。 Here is my desired output: 这是我想要的输出:

    names  'Most frequent'   '2nd most freq'
    'Martha'    'Thu'            'Mon'
    'Johnny'    'Wed'            'Mon'
    'Chloe'     'Tue'            'Thu'
    'Tim'       'Mon'             NaN

I've seen similar questions asking about extending argmax(), but not idmax(). 我见过类似的问题,询问是否要扩展argmax(),而不是idmax()。

My current plan (in pseudocode): 我当前的计划(用伪代码):

    for customer in dataframe:
        for i  = 0,1:
            if all elements zero:
                newdataframe[customer, i] = NaN
            else:
                newdataframe[customer, i] = dataframe.idxmax()[customer]
                dataframe[dataframe.idxmax()[customer], customer] = 0
         return newdataframe

I imagine someone with more experience than I could probably do this a bit more efficiently. 我想象一个有更多经验的人比我可能更有效率地做到这一点。 What do you think? 你怎么看? Is there a more efficient way? 有没有更有效的方法?

Since you want also the 2nd most frequent day, you can define a custom function to do the sort for each column. 由于您还希望第二天成为最频繁的一天,因此您可以定义一个自定义函数来对每一列进行排序。

# your data
# ===========================
df

     Martha  Johnny  Chloe  Tim
Mon       3       2      0    7
Tue       0       0      3    0
Wed       1      12      3    0
Thu       5       0      3    0

# processing
# ======================
def func(col):
    # sort index according column values
    idx_sorted, _ = zip(*sorted(zip(col.index.values, col.values), key=lambda x: x[1]))
    return pd.Series({'most_frequent': idx_sorted[-1], 'second_most_freq': idx_sorted[-2]})

df.apply(func).T

       most_frequent second_most_freq
Martha           Thu              Mon
Johnny           Wed              Mon
Chloe            Thu              Wed
Tim              Mon              Thu

Edit: 编辑:

# processing
# ======================
import numpy as np

def func(col):
    # sort index according column values
    col = col[col > 0]
    idx_sorted, _ = zip(*sorted(zip(col.index.values, col.values), key=lambda x: x[1]))
    d = dict(zip(np.arange(len(idx_sorted)), idx_sorted[::-1]))
    return pd.Series({'most_frequent': d.get(0, np.nan), 'second_most_freq': d.get(1, np.nan)})

df.apply(func).T

       most_frequent second_most_freq
Martha           Thu              Mon
Johnny           Wed              Mon
Chloe            Thu              Wed
Tim              Mon              NaN
df.stack(-1).groupby(level=-1).transform(lambda x: x.argsort(0)).reset_index().pivot('level_1',0).sort_index(axis = 1, ascending = False)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM