简体   繁体   English

优化熊猫代码以提高速度

[英]Optimizing pandas code for speed

I'm working on a recommendation system and am trying to develop a simple model to begin with, but I am having difficulty making the process fast. 我正在研究一个推荐系统,我正在尝试开发一个简单的模型,但是我很难快速完成这个过程。 Right now I am trying to make two dataframes with approximate dimensions of (130000, 40000) that correspond to products (columns) ordered by users (rows), where each entry in the dataframe is an integer count of how many of a product a user has bought. 现在我正在尝试制作两个大小为(130000,40000)的数据帧,这些数据帧对应于用户(行)排序的产品(列),其中数据框中的每个条目是用户产品的数量的整数计数买了。 For the x_df that count is over a number of orders while for the y_df that count is over their last order. 对于x_df,计数超过了许多订单,而对于y_df,计数超过了他们的最后一个订单。

From what I can tell by running scipy.stats.linregress(counts, times), after obtaining 100 or so data points, the run-time is linear with an r^2 value of 0.993. 通过运行scipy.stats.linregress(计数,次数)可以看出,在获得100个左右的数据点之后,运行时间是线性的,r ^ 2值为0.993。 With 130,000 users that would mean this snippet of code would take around 36 hours and I would still have to train an estimator on the dataframe after it is made! 有130,000个用户意味着这段代码需要大约36个小时,我仍然需要在数据帧完成后训练一个估算器! I haven't worked with a data set this large before and I am not sure if this is what should be expected, but I imagine whatever I am doing can be done more efficiently due to my lack of experience. 我以前没有处理过这么大的数据集,我不确定这是否应该是预期的,但我想我所做的一切都可以因为我缺乏经验而更有效率地完成。 Any comments or suggestions would be appreciated. 任何意见或建议将不胜感激。

I should probably clarify: The data set is set up so that order_products_prior contains the set of previous orders for the training users while order_products_train contains the final orders for those users. 我应该澄清一下:设置数据集,以便order_products_prior包含培训用户以前的订单集,而order_products_train包含这些用户的最终订单。

def add_to_df(series, prod_list):
    count = 0
    for prod in prod_list:
        if prod in series.index:
            series[series.index==prod] += 1
            count += 1
        else:
            series = series.set_value(prod, 1)
            count -=1
    return series

import time
start_time = time.time()
count = 0
times = []
counts = []
for user in users_train:
    usr_series_x = pd.Series()
    usr_series_y = pd.Series()
    prod_list_x = []
    prod_list_y = []
    usr_orders = orders[orders['user_id']==user]
    for ord_id in usr_orders['order_id']:
        usr_order_products_prior = order_products_prior[order_products_prior['order_id']==ord_id]['product_id']
        for product in usr_order_products_prior:
            prod_list_x.append(product)
        usr_order_products_train = order_products_train[order_products_train['order_id']==ord_id]['product_id']
        for product in usr_order_products_train:
            prod_list_y.append(product)
    add_to_df(usr_series_x, prod_list_x)
    add_to_df(usr_series_y, prod_list_y)
    x_df.loc[user] = usr_series_x
    y_df.loc[user] = usr_series_y
    count += 1
    if count%5==0:
        print("Pectent Complete: {0}".format(float(count/len(users_list)*100))
        print("--- %s seconds ---" % (time.time() - start_time))
        counts.append(count)
        times.append(time.time() - start_time)

import scipy as sci
plt.plot(counts, times)
sci.stats.linregress(counts, times)

You are using the pandas in a wrong way. 你正在以错误的方式使用熊猫。 Pandas is very fast with vector data operations like groupby , sum, pivot or value_counts. Pandas使用groupby ,sum,pivot或value_counts等矢量数据操作非常快。 Please read this section first: https://pandas.pydata.org/pandas-docs/stable/groupby.html 请先阅读本节: https//pandas.pydata.org/pandas-docs/stable/groupby.html

I ended up figuring it out. 我最终搞清楚了。 I was definitely using Pandas incorrectly, so instead after some groupby operations I was able to have a single data frame with the values of interest. 我肯定错误地使用Pandas,所以在一些groupby操作之后,我能够拥有一个具有感兴趣值的数据帧。 Even from here I found trouble creating the matrices I wanted (took like 1.5 hours) so I decided to use scipy's csr matrices which helped tremendously, bringing the time to ~30 seconds. 即使从这里开始我也很难创建我想要的矩阵(耗时1.5小时),所以我决定使用scipy的csr矩阵,这非常有帮助,将时间缩短到30秒。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM