I am preprocessing data for a Machine Learning classification task by converting categorical variables to a binary matrix, primarily using pd.get_dummies()
. This is applied to a single Pandas DataFrame column and outputs a new DataFrame with the same number of rows as the original and width of unique number of categorical variables in the original column.
I need to complete this for a DataFrame of shape: (3,000,000 x 16)
which outputs a binary matrix of shape: (3,000,000 x 600)
.
During the process, the step of converting to a binary matrix pd.get_dummies()
is very quick, but the assignment to the output matrix was much slower using pd.DataFrame.loc[]
. Since I have switch to saving straight to a np.ndarray
which is much faster, I just wonder why? ( Please see terminal output at bottom of question for time comparison )
nb As pointed out in comments, I could just all pd.get_dummies()
on entire frame. However, some of the columns require tailored preprocessing, ie: putting into buckets. The most difficult column to handle is a column containing a string of tags (seperated by ,
or ,
, which must be processed like this: df[col].str.replace(' ','').str.get_dummies(sep=',')
. Also, the preprocessed training set and test set need the same set of columns (inherited from all_cols) as they might not have the same features present once they are broken into a matrix.
Please see code below for each version
DataFrame version:
def preprocess_df(df):
with open(PICKLE_PATH + 'cols.pkl', 'rb') as handle:
cols = pickle.load(handle)
x = np.zeros(shape=(len(df),len(cols)))
# x = pd.DataFrame(columns=all_cols)
for col in df.columns:
# 1. make binary matrix
df_col = pd.get_dummies(df[col], prefix=str(col))
print "Processed: ", col, datetime.datetime.now()
# 2. assign each value in binary matrix to col in output
for dummy_col in df_col.columns:
x.loc[:, dummy_col] = df_col[dummy_col]
print "Assigned: ", col, datetime.datetime.now()
return x.values
np version:
def preprocess_np(df):
with open(PICKLE_PATH + 'cols.pkl', 'rb') as handle:
cols = pickle.load(handle)
x = np.zeros(shape=(len(df),len(cols)))
for col in df.columns:
# 1. make binary matrix
df_col = pd.get_dummies(df[col], prefix=str(col))
print "Processed: ", col, datetime.datetime.now()
# 2. assign each value in binary matrix to col in output
for dummy_col in df_col.columns:
idx = [i for i,j in enumerate(all_cols) if j == dummy_col][0]
x[:, idx] = df_col[dummy_col].values.T
print "Assigned: ", col, datetime.datetime.now()
return x
Timed outputs ( 10,000
examples)
DataFrame version:
Processed: Weekday
Assigned: Weekday 0.437081
Processed: Hour 0.002366
Assigned: Hour 1.33815
np version:
Processed: Weekday
Assigned: Weekday 0.006992
Processed: Hour 0.002632
Assigned: Hour 0.008989
Is there a different approach to further optimize this? I am interested as at the moment I am discarding a potentially useful feature as it is too slow to process an extra 15,000
columns to the output.
Any general advice on the approach I am taking is also appreciated!
Thank you
One experiment would be to change over to x.loc[:, dummy_col] = df_col[dummy_col].values
. If the input is a series, pandas is checking the order of the indices for each assignment. Assigning with an ndarray would turn that off if it's unnecessary, and that should improve performance.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.