How to make numpy indexing more efficient and faster

Question

I am trying to preprocess my dataset to use for deep learning. I have a csv file that contains the data and I read it using pandas and try to preprocess it. The first column is a string and all the other columns are float. I want to use min max normalization for all columns with float.

#Reading csv with pandas
metadatas=pd.read_csv(os.path.join(dataset_dir,"metadata.csv"),header=None)
metadatas=np.array(metadatas)
metadatas_values=metadatas[:,1:]

#normalize the float datas
scaler=preprocessing.MinMaxScaler()
scaler.fit(metadatas_values)
metadatas_scaled=scaler.transform(metadatas_values)

#Create dataframe to insert the string column ('filename') back to data and turn it back to array again
df_scaled=pd.DataFrame(metadatas_scaled)
df_scaled.insert(0,'filename',metadatas[:,0])
metadatas_scaled=np.array(df_scaled)

#Use for loop to index float columns based on string column'filename'
for filename in filenames:
    metadata=metadatas_scaled[np.where(metadatas_scaled==filename)[0]][0,1:]

I think my code is inefficient and slow to run when I have >30000 files. I think the most time consuming thing is indexing the array in the for loop. Is there a more efficient way to do this? Thank you in advance!

Answer 1

Here is a bit optimized code. The idea is to minimize allocations of new arrays and use groupby() instead of loop+indexing.

# read file and set the first column as index
df = pd.read_csv(
    os.path.join(dataset_dir,"metadata.csv"), 
    header=None, 
    index_col=0
)

# scale all columns except index (our string column)
# + reuse index and column names from the original DataFrame
df_scaled = pd.DataFrame(
    MinMaxScaler().fit_transform(df.values), 
    index=df.index, 
    columns=df.columns
)

# group by index (our string column)
for metadata in df_scaled.groupby(level=0):
    # do smth with metadata
    ...

How to make numpy indexing more efficient and faster

Question

1 answers

solution1
0 2021-11-19 17:21:51

How to make numpy indexing more efficient and faster

Question

1 answers

solution1 0 2021-11-19 17:21:51

solution1
0 2021-11-19 17:21:51