在 pandas dataframe 中加入 coulmn 值的最快方法？

Question

問題：

給定一個大型數據集（300 萬行 x 6 列），根據掩碼為真的行，在單個 pandas 數據框中連接列值的最快方法是什么？

我目前的解決方案：

import pandas as pd
import numpy as np
  
# Note: Real data will be 3 millon rows X 6 columns,
df = pd.DataFrame({'time': ['0', '1', '2', '3'],
               'msg': ['msg0', 'msg1', 'msg0', 'msg2'],
               'd0': ['a', 'x', 'a', '1'],
               'd1': ['b', 'x', 'b', '2'],
               'd2': ['c', 'x', np.nan, '3']})
#print(df)

msg_text_filter = ['msg0', 'msg2']
columns = df.columns.drop(df.columns[:3])
column_join = ["d0"]

mask = df['msg'].isin(msg_text_filter)

df.replace(np.nan,'',inplace=True)
# THIS IS SLOW, HOW TO SPEED UP?
df['d0'] = np.where(
    mask,
    df[['d0','d1','d2']].agg(''.join, axis=1),
    df['d0']
)
df.loc[mask, columns] = np.nan

print(df)

Answer 1

恕我直言，您可以通過使用節省大量時間

df[['d0', 'd1', 'd2']].sum(axis=1)

代替

df[['d0', 'd1', 'd2']].agg(''.join, axis=1)

我認為除了使用np.where你還可以這樣做：

df.loc[mask, 'd0'] = df.loc[mask, ['d0', 'd1', 'd2']].sum(axis=1)

在 pandas dataframe 中加入 coulmn 值的最快方法？

問題描述

1 個解決方案

解決方案1
1 已采納 2022-11-14 12:11:20

在 pandas dataframe 中加入 coulmn 值的最快方法？

問題描述

1 個解決方案

解決方案1 1 已采納 2022-11-14 12:11:20

解決方案1
1 已采納 2022-11-14 12:11:20