用float32和float64分配Pandas DataFrame慢

Question

對於具有某些float32和float64數據類型的Pandas DataFrame進行分配，對於某些組合而言，這會降低我的處理速度。

下面的代碼設置一個DataFrame，對部分數據進行Numpy / Scipy計算，通過復制舊的DataFrame來設置一個新的DataFrame，並將計算結果分配給新的DataFrame：

import pandas as pd
import numpy as np
from scipy.signal import lfilter

N = 1000
M = 1000

def f(dtype1, dtype2):
    coi = [str(m) for m in range(M)]
    df = pd.DataFrame([[m for m in range(M)] + ['Hello', 'World'] for n in range(N)],
                      columns=coi + ['A', 'B'], dtype=dtype1)
    Y = lfilter([1], [0.5, 0.5], df.ix[:, coi])
    Y = Y.astype(dtype2)
    new = pd.DataFrame(df, copy=True)
    print(new.iloc[0, 0].dtype)
    print(Y.dtype)
    new.ix[:, coi] = Y    # This statement is considerably slow
    print(new.iloc[0, 0].dtype)


from time import time

dtypes = [np.float32, np.float64]
for dtype1 in dtypes:
    for dtype2 in dtypes:
        print('-' * 10)
        start_time = time()
        f(dtype1, dtype2)
        print(time() - start_time)

計時結果為：

----------
float32
float32
float64
10.1998147964
----------
float32
float64
float64
10.2371120453
----------
float64
float32
float64
0.864870071411
----------
float64
float64
float64
0.866265058517

這里的關鍵線是new.ix[:, coi] = Y ：對於某些組合，速度慢了十倍。

我可以理解，當存在float32 DataFrame並為其分配了float64時，重新分配需要一些開銷。 但是，為什么開銷如此戲劇性。

此外，float32和float32分配的組合也很慢，結果是float64，這也困擾着我。

Answer 1

單列分配不會更改類型，對於非類型轉換分配-float32和float64，在列上進行for循環迭代似乎相當快。 對於涉及類型轉換的分配，性能通常是多列分配的最差性能的兩倍。

import pandas as pd
import numpy as np
from scipy.signal import lfilter

N = 1000
M = 1000

def f(dtype1, dtype2):
    coi = [str(m) for m in range(M)]
    df = pd.DataFrame([[m for m in range(M)] + ['Hello', 'World'] for n in range(N)],
                      columns=coi + ['A', 'B'], dtype=dtype1)
    Y = lfilter([1], [0.5, 0.5], df.ix[:, coi])
    Y = Y.astype(dtype2)
    new = df.copy()
    print(new.iloc[0, 0].dtype)
    print(Y.dtype)
    for n, column in enumerate(coi):  # For-loop over columns new!
        new.ix[:, column] = Y[:, n]
    print(new.iloc[0, 0].dtype)

from time import time

dtypes = [np.float32, np.float64]
for dtype1 in dtypes:
    for dtype2 in dtypes:
        print('-' * 10)
        start_time = time()
        f(dtype1, dtype2)
        print(time() - start_time)

結果是：

----------
float32
float32
float32
0.809890985489
----------
float32
float64
float64
21.4767119884
----------
float64
float32
float32
20.5611870289
----------
float64
float64
float64
0.765362977982

用float32和float64分配Pandas DataFrame慢

問題描述

1 個解決方案

解決方案1
0 2016-02-09 14:02:51

用float32和float64分配Pandas DataFrame慢

問題描述

1 個解決方案

解決方案1 0 2016-02-09 14:02:51

解決方案1
0 2016-02-09 14:02:51