I am trying to preprocess my dataset to use for deep learning. I have a csv file that contains the data and I read it using pandas and try to preprocess it. The first column is a string and all the other columns are float. I want to use min max normalization for all columns with float.
#Reading csv with pandas
metadatas=pd.read_csv(os.path.join(dataset_dir,"metadata.csv"),header=None)
metadatas=np.array(metadatas)
metadatas_values=metadatas[:,1:]
#normalize the float datas
scaler=preprocessing.MinMaxScaler()
scaler.fit(metadatas_values)
metadatas_scaled=scaler.transform(metadatas_values)
#Create dataframe to insert the string column ('filename') back to data and turn it back to array again
df_scaled=pd.DataFrame(metadatas_scaled)
df_scaled.insert(0,'filename',metadatas[:,0])
metadatas_scaled=np.array(df_scaled)
#Use for loop to index float columns based on string column'filename'
for filename in filenames:
metadata=metadatas_scaled[np.where(metadatas_scaled==filename)[0]][0,1:]
I think my code is inefficient and slow to run when I have >30000 files. I think the most time consuming thing is indexing the array in the for loop. Is there a more efficient way to do this? Thank you in advance!
Here is a bit optimized code. The idea is to minimize allocations of new arrays and use groupby() instead of loop+indexing.
# read file and set the first column as index
df = pd.read_csv(
os.path.join(dataset_dir,"metadata.csv"),
header=None,
index_col=0
)
# scale all columns except index (our string column)
# + reuse index and column names from the original DataFrame
df_scaled = pd.DataFrame(
MinMaxScaler().fit_transform(df.values),
index=df.index,
columns=df.columns
)
# group by index (our string column)
for metadata in df_scaled.groupby(level=0):
# do smth with metadata
...
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.