索引 Numpy 矩陣的最有效方法是什么？

Question

問題：使用 Pandas 數據幀實現以下等效的最有效方法是什么： temp = df[df.feature] == value] at scale （有關上下文 re: scale ，請參見下文）？

背景：我有大約 500 個實體 30 年的每日時間序列數據，對於每個實體和每一天，需要根據過去 240 天的各種回顧創建 90 個特征。目前，我每天都在循環，處理當天的所有數據，然后將其插入到預先分配的 numpy 矩陣中——但事實證明，對於我的數據集的大小來說，它的速度非常慢。

天真的方法：

df = pd.DataFrame()

for day in range(241, t_max):
    temp_a = df_timeseries[df_timeseries.t] == day].copy()
    temp_b = df_timeseries[df_timeseries.t] == day - 1].copy()

    new_val = temp_a.feature_1/temp_b.feature_1

    new_val['t'] = day
    new_val['entity'] = temp_a.entity

    df.concat([df, new_val])

當前方法（簡化）：

df = np.matrix(np.zeros([num_days*num_entities, 3]))

col_dict = dict(zip(df_timeseries.columns, list(range(0,len(df_timeseries.columns)))))

mtrx_timeseries = np.matrix(df_timeseries.to_numpy())

for i, day in enumerate(range(241, t_max)):
    interm = np.matrix(np.zeros([num_entities, 3]))
    interm[:, 0] = day

    temp_a = mtrx_timeseries[np.where(mtrx_timeseries[:, col_dict['t']] == day)[0], :]
    temp_b = mtrx_timeseries[np.where(mtrx_timeseries[:, col_dict['t']] == day - 1)[0], :]
    temp_cr = temp_a[:, col_dict['feature_1']]/temp_b[:, col_dict['feature_1']] - 1

    temp_a = mtrx_timeseries[np.where(mtrx_timeseries[:, col_dict['t']] == day - 5)[0], :]                
    temp_b = mtrx_timeseries[np.where(mtrx_timeseries[:, col_dict['t']] == t - 10)[0], :]
    
    temp_or = temp_a[:, col_dict['feature_1']]/temp_b[:, col_dict['feature_1']] - 1

    interm[:, 1:] = np.concatenate([temp_cr, temp_or], axis=1)

    df[i*num_entities : (i + 1)*num_entities, :] = interm

對完整版本的代碼進行行分析表明， mtrx_timeseries[np.where(mtrx_timeseries[:, col_dict['t']] == day)[0], :]形式的每個語句占用了約 23%總運行時間，因此我正在尋找更簡化的解決方案。由於索引花費的時間最多，並且由於循環意味着每次迭代都會執行此操作，因此也許一種解決方案可能是只索引一次，將每天的數據存儲在單獨的數組元素中，然后循環遍歷數組元素？

Answer 1

這不是您問題的完整解決方案，但我認為它會讓您到達您需要的地方。

考慮以下代碼：

entity_dict = {}
entity_idx = 0
arr = np.zeros((num_entities, t_max-240))

for entity, day, feature in df_timeseries[['entity', 'day', 'feature_1']].values:
    if entity not in entity_dict:
        entity_dict[entity] = entity_idx
        entity_idx += 1
    arr[entity_dict[entity], day-240] = feature

這會將df_timeseries轉換為由實體組織的num_entities*num_days形狀的數組，非常有效。 你根本不需要做任何花哨的索引。 索引 numpy 數組或矩陣的最有效方法是提前知道您需要哪些索引，而不是在數組中搜索它們。 然后您可以執行數組操作（在我看來，您的操作是簡單的元素除法，您可以在幾行中完成，而無需額外的循環）。

然后轉換回原始格式。

索引 Numpy 矩陣的最有效方法是什么？

問題描述

1 個解決方案

解決方案1
0 已采納 2022-05-22 05:54:25

索引 Numpy 矩陣的最有效方法是什么？

問題描述

1 個解決方案

解決方案1 0 已采納 2022-05-22 05:54:25

解決方案1
0 已采納 2022-05-22 05:54:25