有條件地創建大熊貓列的最快方法

Question

在Pandas DataFrame中，我想根據另一列的值有條件地創建一個新列。 在我的應用程序中，DataFrame通常有幾百萬行，並且唯一條件值的數量很小，大約為1。 性能非常重要： 生成新列的最快方法是什么？

我在下面創建了一個示例案例，並嘗試並比較了不同的方法。 在該示例中，條件填充由基於列label的值的字典查找表示（這里： 1, 2, 3 ）。

lookup_dict = {
    1: 100,   # arbitrary
    2: 200,   # arbitrary
    3: 300,   # arbitrary
    }

然后我希望我的DataFrame被填充為：

       label  output
0      3     300
1      2     200
2      3     300
3      3     300
4      2     200
5      2     200
6      1     100
7      1     100

以下是在10M行上測試的6種不同方法（測試代碼中的參數Nlines ）：

方法1： pandas.groupby().apply()
方法2： pandas.groupby().indices.items()
方法3： pandas.Series.map
方法4：用於標簽上的循環
方法5： numpy.select
方法6：numba

完整的代碼在答案的最后提供，包含所有方法的運行時。 在比較性能之前，斷言每種方法的輸出相等。

方法1： `pandas.groupby().apply()`

我在label上使用pandas.groupby() ，然后使用apply()用相同的值填充每個塊。

def fill_output(r):
    ''' called by groupby().apply(): all r.label values are the same '''
    r.loc[:, 'output'] = lookup_dict[r.iloc[0]['label']]
    return r

df = df.groupby('label').apply(fill_output)

我明白了

>>> method_1_groupby ran in 2.29s (average over 3 iterations)

請注意，groupby（）。apply（）在第一個組上運行兩次以確定要使用的代碼路徑（請參閱Pandas＃2936 ）。 這可以減少少數群體的速度。 我欺騙方法1可以添加第一個虛擬組，但我沒有得到太多改進。

方法2： `pandas.groupby().indices.items()`

第二個是變體：而不是使用apply我使用groupby().indices.items()訪問指數directy。 這最終是方法1的兩倍，這是我用了很長時間的方法

dgb = df.groupby('label')
for label, idx in dgb.indices.items():
    df.loc[idx, 'output'] = lookup_dict[label]

得到：

method_2_indices ran in 1.21s (average over 3 iterations)

方法3： `pandas.Series.map`

我用過Pandas.Series.map 。

df['output'] = df.label.map(lookup_dict.get)

在類似情況下，我得到了非常好的結果，其中查找值的數量與行數相當。 在目前的情況下， map最終速度是方法1的兩倍。

method_3_map在3.07s中運行（平均超過3次迭代）

我將其歸因於少量的查找值，但我實現它的方式可能存在問題。

方法4：用於標簽上的循環

第四種方法很天真：我只是遍歷所有標簽並選擇DataFrame的匹配部分。

for label, value in lookup_dict.items():
    df.loc[df.label == label, 'output'] = value

但令人驚訝的是，在之前的案例中，我得到了更快的結果。 我期望基於groupby的解決方案比這個更快，因為Pandas必須在這里與df.label == label進行三次比較。 結果證明我錯了：

method_4_forloop ran in 0.54s (average over 3 iterations)

方法5： `numpy.select`

第五種方法使用numpy select函數，基於此StackOverflow答案。

conditions = [df.label == k for k in lookup_dict.keys()]
choices = list(lookup_dict.values())

df['output'] = np.select(conditions, choices)

這產生了最好的結果：

method_5_select ran in 0.29s (average over 3 iterations)

最后，我在方法6中嘗試了一種numba方法。

方法6：numba

僅僅為了示例，條件填充值是編譯函數中的硬編碼。 我不知道如何給Numba一個列表作為運行時常量：

@jit(int64[:](int64[:]), nopython=True)
def hardcoded_conditional_filling(column):
    output = np.zeros_like(column)
    i = 0
    for c in column:
        if c == 1:
            output[i] = 100
        elif c == 2:
            output[i] = 200
        elif c == 3:
            output[i] = 300
        i += 1
    return output

df['output'] = hardcoded_conditional_filling(df.label.values)

我最好的時間比方法5快了50％。

method_6_numba ran in 0.19s (average over 3 iterations)

由於上述原因，我沒有實現這個：我不知道如何給Numba一個列表作為運行時常量而不會導致性能大幅下降。

完整代碼

import pandas as pd
import numpy as np
from timeit import timeit
from numba import jit, int64

lookup_dict = {
        1: 100,   # arbitrary
        2: 200,   # arbitrary
        3: 300,   # arbitrary
        }

Nlines = int(1e7)

# Generate 
label = np.round(np.random.rand(Nlines)*2+1).astype(np.int64)
df0 = pd.DataFrame(label, columns=['label'])

# Now the goal is to assign the look_up_dict values to a new column 'output' 
# based on the value of label

# Method 1
# using groupby().apply()

def method_1_groupby(df):

    def fill_output(r):
        ''' called by groupby().apply(): all r.label values are the same '''
        #print(r.iloc[0]['label'])   # activate to reveal the #2936 issue in Pandas
        r.loc[:, 'output'] = lookup_dict[r.iloc[0]['label']]
        return r

    df = df.groupby('label').apply(fill_output)
    return df 

def method_2_indices(df):

    dgb = df.groupby('label')
    for label, idx in dgb.indices.items():
        df.loc[idx, 'output'] = lookup_dict[label]

    return df

def method_3_map(df):

    df['output'] = df.label.map(lookup_dict.get)

    return df

def method_4_forloop(df):
    ''' naive '''

    for label, value in lookup_dict.items():
        df.loc[df.label == label, 'output'] = value

    return df

def method_5_select(df):
    ''' Based on answer from 
    https://stackoverflow.com/a/19913845/5622825
    '''

    conditions = [df.label == k for k in lookup_dict.keys()]
    choices = list(lookup_dict.values())

    df['output'] = np.select(conditions, choices)

    return df

def method_6_numba(df):
    ''' This works, but it is hardcoded and i don't really know how
    to make it compile with list as runtime constants'''


    @jit(int64[:](int64[:]), nopython=True)
    def hardcoded_conditional_filling(column):
        output = np.zeros_like(column)
        i = 0
        for c in column:
            if c == 1:
                output[i] = 100
            elif c == 2:
                output[i] = 200
            elif c == 3:
                output[i] = 300
            i += 1
        return output

    df['output'] = hardcoded_conditional_filling(df.label.values)

    return df

df1 = method_1_groupby(df0)
df2 = method_2_indices(df0.copy())
df3 = method_3_map(df0.copy())
df4 = method_4_forloop(df0.copy())
df5 = method_5_select(df0.copy())
df6 = method_6_numba(df0.copy())

# make sure we havent modified the input (would bias the results)
assert 'output' not in df0.columns 

# Test validity
assert (df1 == df2).all().all()
assert (df1 == df3).all().all()
assert (df1 == df4).all().all()
assert (df1 == df5).all().all()
assert (df1 == df6).all().all()

# Compare performances
Nites = 3
print('Compare performances for {0:.1g} lines'.format(Nlines))
print('-'*30)
for method in [
               'method_1_groupby', 'method_2_indices', 
               'method_3_map', 'method_4_forloop', 
               'method_5_select', 'method_6_numba']:
    print('{0} ran in {1:.2f}s (average over {2} iterations)'.format(
            method, 
            timeit("{0}(df)".format(method), setup="from __main__ import df0, {0}; df=df0.copy()".format(method), number=Nites)/Nites,
            Nites))

輸出：

Compare performances for 1e+07 lines
------------------------------
method_1_groupby ran in 2.29s (average over 3 iterations)
method_2_indices ran in 1.21s (average over 3 iterations)
method_3_map ran in 3.07s (average over 3 iterations)
method_4_forloop ran in 0.54s (average over 3 iterations)
method_5_select ran in 0.29s (average over 3 iterations)
method_6_numba ran in 0.19s (average over 3 iterations)

我會對能夠產生更好性能的任何其他解決方案感興趣。 我最初在尋找基於Pandas的方法，但我也接受基於numba / cython的解決方案。

編輯

添加Chrisb的比較方法：

def method_3b_mapdirect(df):
    ''' Suggested by https://stackoverflow.com/a/51388828/5622825'''

    df['output'] = df.label.map(lookup_dict)

    return df

def method_7_take(df):
    ''' Based on answer from 
    https://stackoverflow.com/a/19913845/5622825

    Exploiting that labels are continuous integers
    '''

    lookup_arr = np.array(list(lookup_dict.values()))
    df['output'] = lookup_arr.take(df['label'] - 1)

    return df

運行時間為：

method_3_mapdirect ran in 0.23s (average over 3 iterations)
method_7_take ran in 0.11s (average over 3 iterations)

這使得＃3比任何其他方法（＃6除外）更快，也是最優雅的。 如果您的用戶案例兼容，請使用＃7。

Answer 1

我認為.map （＃3）是這樣做的慣用方法 - 但是不要傳遞.get - .get使用字典，應該看到一個非常顯着的改進。

df = pd.DataFrame({'label': np.random.randint(, 4, size=1000000, dtype='i8')})

%timeit df['output'] = df.label.map(lookup_dict.get)
261 ms ± 12.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit df['output'] = df.label.map(lookup_dict)
69.6 ms ± 3.08 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

如果條件的數量很少，並且比較便宜（即整數和你的查找表），那么直接比較值（4和尤其是5）比.map快，但這並不總是正確的，例如，如果你有一組字符串。

如果你的查找標簽確實是連續的整數，你可以利用它並使用take查找，這應該和numba一樣快。 我認為這基本上和這個一樣快 - 可以在cython中寫出等價物，但不會更快。

%%timeit
lookup_arr = np.array(list(lookup_dict.values()))
df['output'] = lookup_arr.take(df['label'] - 1)
8.68 ms ± 332 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

有條件地創建大熊貓列的最快方法

問題描述

方法1： `pandas.groupby().apply()`

方法2： `pandas.groupby().indices.items()`

方法3： `pandas.Series.map`

方法4：用於標簽上的循環

方法5： `numpy.select`

方法6：numba

完整代碼

編輯

1 個解決方案

解決方案1
7 已采納 2018-07-17 19:22:02

有條件地創建大熊貓列的最快方法

問題描述

方法1： pandas.groupby().apply()

方法2： pandas.groupby().indices.items()

方法3： pandas.Series.map

方法4：用於標簽上的循環

方法5： numpy.select

方法6：numba

完整代碼

編輯

1 個解決方案

解決方案1 7 已采納 2018-07-17 19:22:02

方法1： `pandas.groupby().apply()`

方法2： `pandas.groupby().indices.items()`

方法3： `pandas.Series.map`

方法5： `numpy.select`

解決方案1
7 已采納 2018-07-17 19:22:02