Optimizing pandas code for speed

Question

I'm working on a recommendation system and am trying to develop a simple model to begin with, but I am having difficulty making the process fast. Right now I am trying to make two dataframes with approximate dimensions of (130000, 40000) that correspond to products (columns) ordered by users (rows), where each entry in the dataframe is an integer count of how many of a product a user has bought. For the x_df that count is over a number of orders while for the y_df that count is over their last order.

From what I can tell by running scipy.stats.linregress(counts, times), after obtaining 100 or so data points, the run-time is linear with an r^2 value of 0.993. With 130,000 users that would mean this snippet of code would take around 36 hours and I would still have to train an estimator on the dataframe after it is made! I haven't worked with a data set this large before and I am not sure if this is what should be expected, but I imagine whatever I am doing can be done more efficiently due to my lack of experience. Any comments or suggestions would be appreciated.

I should probably clarify: The data set is set up so that order_products_prior contains the set of previous orders for the training users while order_products_train contains the final orders for those users.

def add_to_df(series, prod_list):
    count = 0
    for prod in prod_list:
        if prod in series.index:
            series[series.index==prod] += 1
            count += 1
        else:
            series = series.set_value(prod, 1)
            count -=1
    return series

import time
start_time = time.time()
count = 0
times = []
counts = []
for user in users_train:
    usr_series_x = pd.Series()
    usr_series_y = pd.Series()
    prod_list_x = []
    prod_list_y = []
    usr_orders = orders[orders['user_id']==user]
    for ord_id in usr_orders['order_id']:
        usr_order_products_prior = order_products_prior[order_products_prior['order_id']==ord_id]['product_id']
        for product in usr_order_products_prior:
            prod_list_x.append(product)
        usr_order_products_train = order_products_train[order_products_train['order_id']==ord_id]['product_id']
        for product in usr_order_products_train:
            prod_list_y.append(product)
    add_to_df(usr_series_x, prod_list_x)
    add_to_df(usr_series_y, prod_list_y)
    x_df.loc[user] = usr_series_x
    y_df.loc[user] = usr_series_y
    count += 1
    if count%5==0:
        print("Pectent Complete: {0}".format(float(count/len(users_list)*100))
        print("--- %s seconds ---" % (time.time() - start_time))
        counts.append(count)
        times.append(time.time() - start_time)

import scipy as sci
plt.plot(counts, times)
sci.stats.linregress(counts, times)

Answer 1

You are using the pandas in a wrong way. Pandas is very fast with vector data operations like groupby , sum, pivot or value_counts. Please read this section first: https://pandas.pydata.org/pandas-docs/stable/groupby.html

Answer 2

I ended up figuring it out. I was definitely using Pandas incorrectly, so instead after some groupby operations I was able to have a single data frame with the values of interest. Even from here I found trouble creating the matrices I wanted (took like 1.5 hours) so I decided to use scipy's csr matrices which helped tremendously, bringing the time to ~30 seconds.

Optimizing pandas code for speed

Question

2 answers

solution1
0 2017-06-23 20:36:39

solution2
0 2017-07-20 19:09:37

Optimizing pandas code for speed

Question

2 answers

solution1 0 2017-06-23 20:36:39

solution2 0 2017-07-20 19:09:37

solution1
0 2017-06-23 20:36:39

solution2
0 2017-07-20 19:09:37