简体   繁体   中英

What is the most efficient way of indexing Numpy matrices?

Question: What is the most efficient way to implement the equivalent of the following, using Pandas dataframes: temp = df[df.feature] == value] at scale (see below for context re: scale)?

Background: I have daily time series data for ~500 entities for 30 years, and for each entity and each day, need to create 90 features based on various look-backs, up to 240 days in the past. Currently, I'm looping through each day, manipulating all of the data from that day, then inserting it into a pre-allocated numpy matrix—but it's proving very slow for the size of my data set.

Naive approach:

df = pd.DataFrame()

for day in range(241, t_max):
    temp_a = df_timeseries[df_timeseries.t] == day].copy()
    temp_b = df_timeseries[df_timeseries.t] == day - 1].copy()

    new_val = temp_a.feature_1/temp_b.feature_1

    new_val['t'] = day
    new_val['entity'] = temp_a.entity

    df.concat([df, new_val])

Current approach (simplified):

df = np.matrix(np.zeros([num_days*num_entities, 3]))

col_dict = dict(zip(df_timeseries.columns, list(range(0,len(df_timeseries.columns)))))

mtrx_timeseries = np.matrix(df_timeseries.to_numpy())

for i, day in enumerate(range(241, t_max)):
    interm = np.matrix(np.zeros([num_entities, 3]))
    interm[:, 0] = day

    temp_a = mtrx_timeseries[np.where(mtrx_timeseries[:, col_dict['t']] == day)[0], :]
    temp_b = mtrx_timeseries[np.where(mtrx_timeseries[:, col_dict['t']] == day - 1)[0], :]
    temp_cr = temp_a[:, col_dict['feature_1']]/temp_b[:, col_dict['feature_1']] - 1

    temp_a = mtrx_timeseries[np.where(mtrx_timeseries[:, col_dict['t']] == day - 5)[0], :]                
    temp_b = mtrx_timeseries[np.where(mtrx_timeseries[:, col_dict['t']] == t - 10)[0], :]
    
    temp_or = temp_a[:, col_dict['feature_1']]/temp_b[:, col_dict['feature_1']] - 1

    interm[:, 1:] = np.concatenate([temp_cr, temp_or], axis=1)

    df[i*num_entities : (i + 1)*num_entities, :] = interm

Line profiling the full version of the code I have shows that each statement of the form mtrx_timeseries[np.where(mtrx_timeseries[:, col_dict['t']] == day)[0], :] takes up ~23% of the time of the run in total, hence my looking for a more streamlined solution. Since indexing takes the most time, and since the loop means that this operation is performed every iteration, perhaps one solution might be to index just once, storing each day's data in a separate array element, and then looping through array elements?

This isn't a complete solution to your problem, but I think it will get you where you need to be.

Consider the following code:

entity_dict = {}
entity_idx = 0
arr = np.zeros((num_entities, t_max-240))

for entity, day, feature in df_timeseries[['entity', 'day', 'feature_1']].values:
    if entity not in entity_dict:
        entity_dict[entity] = entity_idx
        entity_idx += 1
    arr[entity_dict[entity], day-240] = feature

This will convert df_timeseries into an num_entities*num_days shaped array organized by entities, very efficiently. You won't need to do any fancy indexing at all. The most efficient way to index a numpy array or matrix is to know what indices you need ahead of time and not search the array for them. You can then perform array operations (it looks to me like your operation is simple elementwise division, which you can do in a couple lines with no extra loop).

Then convert back to the original format.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM