有效計數與DataFrame列不同，按行分組

Question

什么是最快的方式（理智pythonicity的范圍內）來計算不同的值，在同樣的列dtype ，用於在每一行DataFrame ？

詳細信息：我按主題（按行）按天（按列）顯示分類結果的DataFrame ，類似於以下生成的內容。

import numpy as np
import pandas as pd

def genSampleData(custCount, dayCount, discreteChoices):
    """generate example dataset"""
    np.random.seed(123)     
    return pd.concat([
               pd.DataFrame({'custId':np.array(range(1,int(custCount)+1))}),
               pd.DataFrame(
                columns = np.array(['day%d' % x for x in range(1,int(dayCount)+1)]),
                data = np.random.choice(a=np.array(discreteChoices), 
                                        size=(int(custCount), int(dayCount)))    
               )], axis=1)

例如，如果數據集告訴我們每個顧客在每次訪問商店時訂購了哪種飲料，我想知道每個顧客的不同飲料數量。

# notional discrete choice outcome          
drinkOptions, drinkIndex = np.unique(['coffee','tea','juice','soda','water'], 
                                     return_inverse=True) 

# integer-coded discrete choice outcomes
d = genSampleData(2,3, drinkIndex)
d
#   custId  day1  day2  day3
#0       1     1     4     1
#1       2     3     2     1

# Count distinct choices per subject -- this is what I want to do efficiently on larger DF
d.iloc[:,1:].apply(lambda x: len(np.unique(x)), axis=1)
#0    2
#1    3

# Note: I have coded the choices as `int` rather than `str` to speed up comparisons.
# To reconstruct the choice names, we could do:
# d.iloc[:,1:] = drinkOptions[d.iloc[:,1:]]

我試過的：這個用例中的數據集將包含比天數更多的主題（下面的示例testDf ），所以我試圖找到最有效的逐行操作：

testDf = genSampleData(100000,3, drinkIndex)

#---- Original attempts ----
%timeit -n20 testDf.iloc[:,1:].apply(lambda x: x.nunique(), axis=1)
# I didn't wait for this to finish -- something more than 5 seconds per loop
%timeit -n20 testDf.iloc[:,1:].apply(lambda x: len(x.unique()), axis=1)
# Also too slow
%timeit -n20 testDf.iloc[:,1:].apply(lambda x: len(np.unique(x)), axis=1)
#20 loops, best of 3: 2.07 s per loop

為了改進我原來的嘗試，我們注意到pandas.DataFrame.apply（）接受了這個參數：

如果raw=True則傳遞的函數將接收ndarray對象。 如果您只是應用NumPy減少功能，這將獲得更好的性能

這確實將運行時間縮短了一半以上：

%timeit -n20 testDf.iloc[:,1:].apply(lambda x: len(np.unique(x)), axis=1, raw=True)
#20 loops, best of 3: 721 ms per loop *best so far*

令我感到驚訝的是，一個純粹的numpy解決方案，似乎與上面的raw=True ，實際上有點慢：

%timeit -n20 np.apply_along_axis(lambda x: len(np.unique(x)), axis=1, arr = testDf.iloc[:,1:].values)
#20 loops, best of 3: 1.04 s per loop

最后，我還嘗試調換數據，以便進DataFrame.apply() 計數不同，我認為這可能更有效（至少對於DataFrame.apply() ，但似乎沒有有意義的區別。

%timeit -n20 testDf.iloc[:,1:].T.apply(lambda x: len(np.unique(x)), raw=True)
#20 loops, best of 3: 712 ms per loop *best so far*
%timeit -n20 np.apply_along_axis(lambda x: len(np.unique(x)), axis=0, arr = testDf.iloc[:,1:].values.T)
# 20 loops, best of 3: 1.13 s per loop

到目前為止，我最好的解決方案是df.apply of len(np.unique())的奇怪組合，但我還應該嘗試什么呢？

Answer 1

我的理解是，nunique針對大型系列進行了優化。 在這里，你只有3天。 將每列與其他列進行比較似乎更快：

testDf = genSampleData(100000,3, drinkIndex)
days = testDf.columns[1:]

%timeit testDf.iloc[:, 1:].stack().groupby(level=0).nunique()
10 loops, best of 3: 46.8 ms per loop

%timeit pd.melt(testDf, id_vars ='custId').groupby('custId').value.nunique()
10 loops, best of 3: 47.6 ms per loop

%%timeit
testDf['nunique'] = 1
for col1, col2 in zip(days, days[1:]):
    testDf['nunique'] += ~((testDf[[col2]].values == testDf.ix[:, 'day1':col1].values)).any(axis=1)
100 loops, best of 3: 3.83 ms per loop

當你添加更多列時它會失去優勢。 對於不同數量的列（相同的順序： stack().groupby() ， pd.melt().groupby()和循環）：

10 columns: 143ms, 161ms, 30.9ms
50 columns: 749ms, 968ms, 635ms
100 columns: 1.52s, 2.11s, 2.33s

Answer 2

pandas.melt與DataFrame.groupby和groupby.SeriesGroupBy.nunique似乎打破了其他解決方案：

%timeit -n20 pd.melt(testDf, id_vars ='custId').groupby('custId').value.nunique()
#20 loops, best of 3: 67.3 ms per loop

Answer 3

你不需要custId 。 我stack ，然后groupby

testDf.iloc[:, 1:].stack().groupby(level=0).nunique()

有效計數與DataFrame列不同，按行分組

問題描述

3 個解決方案

解決方案1
3 已采納 2016-08-04 15:49:30

解決方案2
2 2016-08-04 14:22:36

解決方案3
1 2016-08-04 14:45:26

有效計數與DataFrame列不同，按行分組

問題描述

3 個解決方案

解決方案1 3 已采納 2016-08-04 15:49:30

解決方案2 2 2016-08-04 14:22:36

解決方案3 1 2016-08-04 14:45:26

解決方案1
3 已采納 2016-08-04 15:49:30

解決方案2
2 2016-08-04 14:22:36

解決方案3
1 2016-08-04 14:45:26