[英]how to do cross validation and hyper parameter tuning for huge dataset?
I have a csv file of 10+gb ,i used "chunksize" parameter available in the pandas.read_csv() to read and pre-process the data,for training the model want to use one of the online learning algo. 我有一个10 + gb的csv文件,我在pandas.read_csv()中使用了“ chunksize”参数来读取和预处理数据,以训练该模型要使用一种在线学习算法。
normally cross-validation and hyper-parameter tuning is done on the entire training data set and train the model using the best hyper-parameter,but in the case of the huge data, if i do the same on the chunk of the training data how to choose the hyper-parameter? 通常在整个训练数据集上进行交叉验证和超参数调整,并使用最佳超参数训练模型,但是在海量数据的情况下,如果我对训练数据块进行相同的操作选择超参数?
I believe you are looking for online learning algorithms like the ones mentioned on this link Scaling Strategies for large datasets . 我相信您正在寻找在线学习算法,例如本链接针对大型数据集的缩放策略中提到的算法。 You should use algorithms that support partial_fit
parameter to load these large datasets in chunks. 您应该使用支持partial_fit
参数的算法来分块加载这些大型数据集。 You can also look at the following links to see which one helps you the best, since you haven't specified the exact problem or the algorithm that you are working on: 您还可以查看以下链接,以查看哪一个对您有最大的帮助,因为您尚未指定确切的问题或正在使用的算法:
EDIT : If you want to solve the class imbalance problem, you can try this : imabalanced-learn library in Python 编辑 :如果您想解决类不平衡问题,可以尝试一下: python中的imabalanced-learn库
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.