在为推荐系统创建矩阵时如何处理非常大的数据集？

Question

I am trying to create a transaction and product groups matrix but I have a very large transaction data (over 10,000,000 rows) and around 100 product groups.我正在尝试创建一个交易和产品组矩阵，但我有一个非常大的交易数据（超过 10,000,000 行）和大约 100 个产品组。 When I try to create a pivot table using this code当我尝试使用此代码创建 pivot 表时

df.pivot(index='transaction_id', columns='product_group', values='ratings')

It returned values error "Unstacked DataFrame is too big, causing int32 overflow"它返回值错误“未堆叠的 DataFrame 太大，导致 int32 溢出”

Is there anyway to deal with this issue other than decrease the size of the data?除了减少数据的大小之外，还有什么方法可以解决这个问题吗？

Thanks!谢谢！

Answer 1

Convert your axes columns to categories:将轴列转换为类别：

df['transaction_id'] = df['transaction_id'].astype('category')
df['product_group'] = df['product_group'].astype('category')

Make a sparse matrix using the encodings:使用编码创建一个稀疏矩阵：

arr = csr_matrix((df['ratings'].values, (df['transaction_id'].cat.codes, df['product_group'].cat.codes)))

Then you just have to keep track of the order of your axes ( df['transaction_id'].cat.categories will give you the labels that should be applied to the rows for example).然后，您只需要跟踪轴的顺序（例如， df['transaction_id'].cat.categories将为您提供应应用于行的标签）。

在为推荐系统创建矩阵时如何处理非常大的数据集？

问题描述

1 个解决方案

解决方案1
0 2021-06-11 19:56:57

在为推荐系统创建矩阵时如何处理非常大的数据集？

问题描述

1 个解决方案

解决方案1 0 2021-06-11 19:56:57

解决方案1
0 2021-06-11 19:56:57