[英]How do you deal with very large dataset when creating the matrix for recommender system?
I am trying to create a transaction and product groups matrix but I have a very large transaction data (over 10,000,000 rows) and around 100 product groups.我正在尝试创建一个交易和产品组矩阵,但我有一个非常大的交易数据(超过 10,000,000 行)和大约 100 个产品组。 When I try to create a pivot table using this code当我尝试使用此代码创建 pivot 表时
df.pivot(index='transaction_id', columns='product_group', values='ratings')
It returned values error "Unstacked DataFrame is too big, causing int32 overflow"它返回值错误“未堆叠的 DataFrame 太大,导致 int32 溢出”
Is there anyway to deal with this issue other than decrease the size of the data?除了减少数据的大小之外,还有什么方法可以解决这个问题吗?
Thanks!谢谢!
Convert your axes columns to categories:将轴列转换为类别:
df['transaction_id'] = df['transaction_id'].astype('category')
df['product_group'] = df['product_group'].astype('category')
Make a sparse matrix using the encodings:使用编码创建一个稀疏矩阵:
arr = csr_matrix((df['ratings'].values, (df['transaction_id'].cat.codes, df['product_group'].cat.codes)))
Then you just have to keep track of the order of your axes ( df['transaction_id'].cat.categories
will give you the labels that should be applied to the rows for example).然后,您只需要跟踪轴的顺序(例如, df['transaction_id'].cat.categories
将为您提供应应用于行的标签)。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.