简体   繁体   中英

Python Pandas: Why is numpy so much faster than Pandas for column assignment? Can I optimize further?

I am preprocessing data for a Machine Learning classification task by converting categorical variables to a binary matrix, primarily using pd.get_dummies() . This is applied to a single Pandas DataFrame column and outputs a new DataFrame with the same number of rows as the original and width of unique number of categorical variables in the original column.

I need to complete this for a DataFrame of shape: (3,000,000 x 16) which outputs a binary matrix of shape: (3,000,000 x 600) .

During the process, the step of converting to a binary matrix pd.get_dummies() is very quick, but the assignment to the output matrix was much slower using pd.DataFrame.loc[] . Since I have switch to saving straight to a np.ndarray which is much faster, I just wonder why? ( Please see terminal output at bottom of question for time comparison )

nb As pointed out in comments, I could just all pd.get_dummies() on entire frame. However, some of the columns require tailored preprocessing, ie: putting into buckets. The most difficult column to handle is a column containing a string of tags (seperated by , or , , which must be processed like this: df[col].str.replace(' ','').str.get_dummies(sep=',') . Also, the preprocessed training set and test set need the same set of columns (inherited from all_cols) as they might not have the same features present once they are broken into a matrix.

Please see code below for each version

DataFrame version:

def preprocess_df(df):
    with open(PICKLE_PATH + 'cols.pkl', 'rb') as handle:
        cols = pickle.load(handle)

    x = np.zeros(shape=(len(df),len(cols)))
    # x = pd.DataFrame(columns=all_cols)

    for col in df.columns:
        # 1. make binary matrix
        df_col = pd.get_dummies(df[col], prefix=str(col))

        print "Processed: ", col,  datetime.datetime.now()

        # 2. assign each value in binary matrix to col in output
        for dummy_col in df_col.columns:
            x.loc[:, dummy_col] = df_col[dummy_col]

        print "Assigned: ", col,  datetime.datetime.now()

    return x.values

np version:

def preprocess_np(df):
    with open(PICKLE_PATH + 'cols.pkl', 'rb') as handle:
        cols = pickle.load(handle)

    x = np.zeros(shape=(len(df),len(cols)))

    for col in df.columns:
        # 1. make binary matrix
        df_col = pd.get_dummies(df[col], prefix=str(col))

        print "Processed: ", col,  datetime.datetime.now()

        # 2. assign each value in binary matrix to col in output
        for dummy_col in df_col.columns:
            idx = [i for i,j in enumerate(all_cols) if j == dummy_col][0]
            x[:, idx] = df_col[dummy_col].values.T

        print "Assigned: ", col,  datetime.datetime.now()

    return x

Timed outputs ( 10,000 examples)

DataFrame version:

Processed:  Weekday 
Assigned:  Weekday 0.437081  
Processed:  Hour 0.002366
Assigned:  Hour 1.33815

np version:

Processed:  Weekday   
Assigned:  Weekday 0.006992
Processed:  Hour 0.002632
Assigned:  Hour 0.008989

Is there a different approach to further optimize this? I am interested as at the moment I am discarding a potentially useful feature as it is too slow to process an extra 15,000 columns to the output.

Any general advice on the approach I am taking is also appreciated!

Thank you

One experiment would be to change over to x.loc[:, dummy_col] = df_col[dummy_col].values . If the input is a series, pandas is checking the order of the indices for each assignment. Assigning with an ndarray would turn that off if it's unnecessary, and that should improve performance.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM