使用混合dtype列有效地更新pandas數據幀中的值

Question

我有一個大型的pandas DataFrame，其形狀（700,000,5,000）包含混合dtypes列（大多數是int8，一些是float64，還有幾個datetime64 [ns]）。 對於數據幀中的每一行，如果另一列也等於零，我想將某些列的值設置為零。

如果我迭代數據幀並使用iloc設置值，那么它非常慢。 我試過iterrows和itertuples例如

你好

ix_1 = 3
ix_to_change = [20, 24, 51]  # Actually it is almost 5000 columns to change
for i, row in df.iterrows():
    if not row[ix_1]:
        df.iloc[i, ix_to_change] = 0

迭代：

ix_1 = 3
ix_to_change = [20, 24, 51]  # Actually it is almost 5000 columns to change
for row in df.itertuples():
    if not row[ix_1 + 1]:
        df.iloc[row[0], ix_to_change] = 0

我也嘗試過使用pandas索引，但它也很慢（雖然比iterrows或itertuples更好）。

3. pandas loc＆iloc

df.loc[df.iloc[:, ix_1]==0, df.columns[ix_to_change]] = 0

然后我嘗試下降到底層的numpy數組，它在性能方面工作正常，但我遇到了dtypes的問題。

它會快速遍歷底層數組，但新數據框具有所有“對象”dtypes。 如果我嘗試設置每列的dtypes（如本例所示），它會在datetime列上失敗 - 可能是因為它們包含NaT項。

numpy

X = df.values
for i, x in enumerate(X):
    if not x[ix_1]:
        X[i].put(ix_to_change, 0)
original_dtypes = df.dtypes
df = pd.DataFrame(data=X, index=df.index, columns=df.columns)
for col, col_dtype in original_dtypes.items():
    df[c] = df[c].astype(col_dtype)

有沒有更好的方法讓我首先進行更新？

或者如果沒有，我應該如何保持我的dtypes相同（datetime列不在列表中以便在相關的情況下更改）？

或者也許有更好的方法讓我用我更新的numpy數組更新原始數據幀，我只更新已更改的列（所有這些都是int8）？

更新

根據評論中的要求，這是一個最小的例子，說明了int8 dtypes在進入numpy之后如何成為對象dtypes。 要清楚，這只是上面方法4的一個問題（這是我到目前為止唯一的非慢速方法 - 如果我可以解決這個dtype問題）：

import pandas as pd

df = pd.DataFrame({'int8_col':[10,11,12], 'float64_col':[1.5, 2.5, 3.5]})
df['int8_col'] = df['int8_col'].astype('int8')
df['datetime64_col'] = pd.to_datetime(['2018-01-01', '2018-01-02', '2018-01-03'])

>>> df.dtypes
float64_col              float64
int8_col                    int8
datetime64_col    datetime64[ns]
dtype: object

X = df.values
# At this point in real life I modify the int8 column(s) only in X

new_df = pd.DataFrame(data=X, index=df.index, columns=df.columns)

>>> new_df.dtypes
float64_col       object
int8_col          object
datetime64_col    object
dtype: object

Answer 1

TL; DR

對於Pandas / NumPy效率，請勿在列中使用混合類型（ object dtype）。 有一些方法可以將系列轉換為數字，然后有效地操作它們。

您可以使用pd.DataFrame.select_dtypes來確定數字列。 假設這些是您希望更新值的唯一值，則可以將這些值提供給pd.DataFrame.loc 。

它會快速遍歷底層數組，但新數據框具有所有“對象”dtypes。

鑒於你留下了object dtype系列，似乎你對ix_to_change的定義包括非數字系列。 在這種情況下，您應該將所有數字列轉換為數字dtype 。 例如，使用pd.to_numeric ：

df[ix_to_change] = df[ix_to_change].apply(pd.to_numeric, errors='coerce')

Pandas / NumPy在性能方面無法幫助object dtype系列，如果這是你所追求的。 這些系列在內部表示為一系列指針，很像list 。

這是一個展示你可以做什么的例子：

import pandas as pd, numpy as np

df = pd.DataFrame({'key': [0, 2, 0, 4, 0],
                   'A': [0.5, 1.5, 2.5, 3.5, 4.5],
                   'B': [2134, 5634, 134, 63, 1234],
                   'C': ['fsaf', 'sdafas',' dsaf', 'sdgf', 'fdsg'],
                   'D': [np.nan, pd.to_datetime('today'), np.nan, np.nan, np.nan],
                   'E': [True, False, True, True, False]})

numeric_cols = df.select_dtypes(include=[np.number]).columns

df.loc[df['key'] == 0, numeric_cols] = 0

結果：

     A     B       C          D      E  key
0  0.0     0    fsaf        NaT   True    0
1  1.5  5634  sdafas 2018-09-05  False    2
2  0.0     0    dsaf        NaT   True    0
3  3.5    63    sdgf        NaT   True    4
4  0.0     0    fdsg        NaT  False    0

正如預期的那樣，沒有轉換為數字列的object dtype系列：

print(df.dtypes)

A             float64
B               int64
C              object
D      datetime64[ns]
E                bool
key             int64
dtype: object

Answer 2

這在更新值時使用NumPy迭代的效率並且也解決了dtype問題。

# numpy array of rows. Only includes columns to update (all int8) so dtype doesn't change
X = df.iloc[:, ix_to_change].values

# Set index on key to allow enumeration to match index
key_col = df.iloc[:, ix_1]
key_col.index = range(len(key_col))

# Set entire row (~5000 values) to zeros. More efficient than updating element-wise.
zero_row = np.zeros(X.shape[1])
for i, row in enumerate(X):
    if key_col[i] == 0:
        X[i] = zero_row

# Transpose to get array of column arrays.
# Each column array creates and replaces a Series in the DataFrame
for i, row in enumerate(X.T):
    df[df.columns[ix_to_change[i]]] = row

X是一個NumPy數組，只有我想要“零”的列，它們都是int8 dtype。

我迭代這些X行（這里比在pandas中更有效），然后XT給了我可以用來替換pandas中的整個列的數組。

這樣可以避免大數據幀上的iloc / loc調用緩慢，最終所有列的dtypes都保持不變。

使用混合dtype列有效地更新pandas數據幀中的值

問題描述

更新

2 個解決方案

解決方案1
1 2018-09-05 08:12:53

TL; DR

解決方案2
0 已采納 2018-09-05 12:13:34

使用混合dtype列有效地更新pandas數據幀中的值

問題描述

更新

2 個解決方案

解決方案1 1 2018-09-05 08:12:53

TL; DR

解決方案2 0 已采納 2018-09-05 12:13:34

解決方案1
1 2018-09-05 08:12:53

解決方案2
0 已采納 2018-09-05 12:13:34