简体   繁体   中英

how to do cross validation and hyper parameter tuning for huge dataset?

I have a csv file of 10+gb ,i used "chunksize" parameter available in the pandas.read_csv() to read and pre-process the data,for training the model want to use one of the online learning algo.

normally cross-validation and hyper-parameter tuning is done on the entire training data set and train the model using the best hyper-parameter,but in the case of the huge data, if i do the same on the chunk of the training data how to choose the hyper-parameter?

I believe you are looking for online learning algorithms like the ones mentioned on this link Scaling Strategies for large datasets . You should use algorithms that support partial_fit parameter to load these large datasets in chunks. You can also look at the following links to see which one helps you the best, since you haven't specified the exact problem or the algorithm that you are working on:

EDIT : If you want to solve the class imbalance problem, you can try this : imabalanced-learn library in Python

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM