I have a pandas dataframe that has days of the week for rows, and names for columns. Inside the dataframe are integers representing how many times that person has entered the store on that weekday. It looks like this:
names 'Martha' 'Johnny' 'Chloe' 'Tim'
'Mon' 3 2 0 7
'Tue' 0 0 3 0
'Wed' 1 12 3 0
'Thu' 5 0 3 0
I want, for each customer, to rank the days of the week they tend to shop on, and pick the top two. In case of duplicates (for example Chloe) order doesn't matter, as long as two of of three possibilities are chosen. In case someone has only gone to the store on one day (for example Tim) I'd want the second spot to be null. Here is my desired output:
names 'Most frequent' '2nd most freq'
'Martha' 'Thu' 'Mon'
'Johnny' 'Wed' 'Mon'
'Chloe' 'Tue' 'Thu'
'Tim' 'Mon' NaN
I've seen similar questions asking about extending argmax(), but not idmax().
My current plan (in pseudocode):
for customer in dataframe:
for i = 0,1:
if all elements zero:
newdataframe[customer, i] = NaN
else:
newdataframe[customer, i] = dataframe.idxmax()[customer]
dataframe[dataframe.idxmax()[customer], customer] = 0
return newdataframe
I imagine someone with more experience than I could probably do this a bit more efficiently. What do you think? Is there a more efficient way?
Since you want also the 2nd most frequent day, you can define a custom function to do the sort for each column.
# your data
# ===========================
df
Martha Johnny Chloe Tim
Mon 3 2 0 7
Tue 0 0 3 0
Wed 1 12 3 0
Thu 5 0 3 0
# processing
# ======================
def func(col):
# sort index according column values
idx_sorted, _ = zip(*sorted(zip(col.index.values, col.values), key=lambda x: x[1]))
return pd.Series({'most_frequent': idx_sorted[-1], 'second_most_freq': idx_sorted[-2]})
df.apply(func).T
most_frequent second_most_freq
Martha Thu Mon
Johnny Wed Mon
Chloe Thu Wed
Tim Mon Thu
# processing
# ======================
import numpy as np
def func(col):
# sort index according column values
col = col[col > 0]
idx_sorted, _ = zip(*sorted(zip(col.index.values, col.values), key=lambda x: x[1]))
d = dict(zip(np.arange(len(idx_sorted)), idx_sorted[::-1]))
return pd.Series({'most_frequent': d.get(0, np.nan), 'second_most_freq': d.get(1, np.nan)})
df.apply(func).T
most_frequent second_most_freq
Martha Thu Mon
Johnny Wed Mon
Chloe Thu Wed
Tim Mon NaN
df.stack(-1).groupby(level=-1).transform(lambda x: x.argsort(0)).reset_index().pivot('level_1',0).sort_index(axis = 1, ascending = False)
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.