简体   繁体   English

在为推荐系统创建矩阵时如何处理非常大的数据集?

[英]How do you deal with very large dataset when creating the matrix for recommender system?

I am trying to create a transaction and product groups matrix but I have a very large transaction data (over 10,000,000 rows) and around 100 product groups.我正在尝试创建一个交易和产品组矩阵,但我有一个非常大的交易数据(超过 10,000,000 行)和大约 100 个产品组。 When I try to create a pivot table using this code当我尝试使用此代码创建 pivot 表时

df.pivot(index='transaction_id', columns='product_group', values='ratings')

It returned values error "Unstacked DataFrame is too big, causing int32 overflow"它返回值错误“未堆叠的 DataFrame 太大,导致 int32 溢出”

Is there anyway to deal with this issue other than decrease the size of the data?除了减少数据的大小之外,还有什么方法可以解决这个问题吗?

Thanks!谢谢!

Convert your axes columns to categories:将轴列转换为类别:

df['transaction_id'] = df['transaction_id'].astype('category')
df['product_group'] = df['product_group'].astype('category')

Make a sparse matrix using the encodings:使用编码创建一个稀疏矩阵:

arr = csr_matrix((df['ratings'].values, (df['transaction_id'].cat.codes, df['product_group'].cat.codes)))

Then you just have to keep track of the order of your axes ( df['transaction_id'].cat.categories will give you the labels that should be applied to the rows for example).然后,您只需要跟踪轴的顺序(例如, df['transaction_id'].cat.categories将为您提供应应用于行的标签)。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何使用熊猫为推荐系统旋转大型数据集? - How to pivot a large dataset for recommender system with panda? 如何对 LightFM 电影推荐系统的用户项目交互矩阵进行交叉验证? - How can I do cross validation on user-item interactions matrix for LightFM movie recommender system? 如何在python中解压缩非常大的文件? - How do you unzip very large files in python? 如何在Python中有效地计算非常大的数据集的基数? - How do you count cardinality of very large datasets efficiently in Python? 训练/测试矩阵书交叉推荐系统 - Train/Test Matrix Book Crossing Recommender System 如何在“惊喜”Python 推荐系统中加载 CSV 文件而不是内置数据集? - How to load CSV file instead of built in dataset in “Surprise” Python recommender system? 如何在python中创建非常大的矩阵 - How to create very large ones matrix in python 如何在python中创建一个非常大的二维矩阵? - How to create a very large 2 dimensional matrix in python? 在评估 model 时如何处理随机性? - How do you deal with randomness when evaluating a model? 如何使用HDF存储非常大的矩阵 - How to use HDF to store a very large matrix
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM