[英]pandas dataframe, and using idmax() on the n most frequent values
I have a pandas dataframe that has days of the week for rows, and names for columns. 我有一个pandas数据框,该行的星期几是行,列的名字是。 Inside the dataframe are integers representing how many times that person has entered the store on that weekday.
数据框内是整数,表示该人在该工作日进入商店的次数。 It looks like this:
看起来像这样:
names 'Martha' 'Johnny' 'Chloe' 'Tim'
'Mon' 3 2 0 7
'Tue' 0 0 3 0
'Wed' 1 12 3 0
'Thu' 5 0 3 0
I want, for each customer, to rank the days of the week they tend to shop on, and pick the top two. 我想为每个客户确定他们倾向于购物的一周中的哪几天,并选择前两个。 In case of duplicates (for example Chloe) order doesn't matter, as long as two of of three possibilities are chosen.
如果重复(例如Chloe),则顺序无关紧要,只要选择三种可能性中的两种即可。 In case someone has only gone to the store on one day (for example Tim) I'd want the second spot to be null.
万一某人一天才去商店(例如蒂姆),我希望第二个位置为空。 Here is my desired output:
这是我想要的输出:
names 'Most frequent' '2nd most freq'
'Martha' 'Thu' 'Mon'
'Johnny' 'Wed' 'Mon'
'Chloe' 'Tue' 'Thu'
'Tim' 'Mon' NaN
I've seen similar questions asking about extending argmax(), but not idmax(). 我见过类似的问题,询问是否要扩展argmax(),而不是idmax()。
My current plan (in pseudocode): 我当前的计划(用伪代码):
for customer in dataframe:
for i = 0,1:
if all elements zero:
newdataframe[customer, i] = NaN
else:
newdataframe[customer, i] = dataframe.idxmax()[customer]
dataframe[dataframe.idxmax()[customer], customer] = 0
return newdataframe
I imagine someone with more experience than I could probably do this a bit more efficiently. 我想象一个有更多经验的人比我可能更有效率地做到这一点。 What do you think?
你怎么看? Is there a more efficient way?
有没有更有效的方法?
Since you want also the 2nd most frequent day, you can define a custom function to do the sort for each column. 由于您还希望第二天成为最频繁的一天,因此您可以定义一个自定义函数来对每一列进行排序。
# your data
# ===========================
df
Martha Johnny Chloe Tim
Mon 3 2 0 7
Tue 0 0 3 0
Wed 1 12 3 0
Thu 5 0 3 0
# processing
# ======================
def func(col):
# sort index according column values
idx_sorted, _ = zip(*sorted(zip(col.index.values, col.values), key=lambda x: x[1]))
return pd.Series({'most_frequent': idx_sorted[-1], 'second_most_freq': idx_sorted[-2]})
df.apply(func).T
most_frequent second_most_freq
Martha Thu Mon
Johnny Wed Mon
Chloe Thu Wed
Tim Mon Thu
# processing
# ======================
import numpy as np
def func(col):
# sort index according column values
col = col[col > 0]
idx_sorted, _ = zip(*sorted(zip(col.index.values, col.values), key=lambda x: x[1]))
d = dict(zip(np.arange(len(idx_sorted)), idx_sorted[::-1]))
return pd.Series({'most_frequent': d.get(0, np.nan), 'second_most_freq': d.get(1, np.nan)})
df.apply(func).T
most_frequent second_most_freq
Martha Thu Mon
Johnny Wed Mon
Chloe Thu Wed
Tim Mon NaN
df.stack(-1).groupby(level=-1).transform(lambda x: x.argsort(0)).reset_index().pivot('level_1',0).sort_index(axis = 1, ascending = False)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.