简体   繁体   中英

Does the dataset size influence a machine learning algorithm?

So, imagine having access to sufficient data (millions of datapoints for training and testing) of sufficient quality. Please ignore concept drift for now and assume the data static and does not change over time. Does it even make sense to use all of that data in terms of the quality of the model?

Brain and Webb ( http://www.csse.monash.edu.au/~webb/Files/BrainWebb99.pdf ) have included some results on experimenting with different dataset sizes. Their tested algorithms converge to being somewhat stable after training with 16,000 or 32,000 datapoints. However, since we're living in the big data world we have access to data sets of millions of points, so the paper is somewhat relevant but hugely outdated.

Is there any know more recent research on the impact of dataset sizes on learning algorithms (Naive Bayes, Decision Trees, SVM, neural networks etc).

  1. When does a learning algorithm converge to a certain stable model for which more data does not increase the quality anymore?
  2. Can it happen after 50,000 datapoints, or maybe after 200,000 or only after 1,000,000?
  3. Is there a rule of thumb?
  4. Or maybe there is no way for an algorithm to converge to a stable model, to a certain equilibrium?

Why am I asking this? Imagine a system with limited storage and a huge amount of unique models (thousands of models with their own unique dataset) and no way of increasing the storage. So limiting the size of a dataset is important.

Any thoughts or research on this?

I did my master's thesis on this subject so I happen to know quite a bit about it.

In a few words in the first part of my master's thesis, I took some really big datasets (~5,000,000 samples) and tested some machine learning algorithms on them by learning on different % of the dataset (learning curves). HIGGS的结果

The hypothesis I made (I was using scikit-learn mostly) was not to optimize the parameters, using the default parameters for the algorithms (I had to make this hypothesis for practical reasons, without optimization some simulations took already more than 24 hours on a cluster).

The first thing to note is that, effectively, every method will lead to a plateau for a certain portion of the dataset. You cannot, however, draw conclusions about the effective number of samples it takes for a plateau to be reached for the following reasons :

  • Every dataset is different, for really simple datasets they can give you nearly everything they have to offer with 10 samples while some still have something to reveal after 12000 samples (See the Higgs dataset in my example above).
  • The number of samples in a dataset is arbitrary, in my thesis I tested a dataset with wrong samples that were only added to mess with the algorithms.

We can, however, differentiate two different types of algorithms that will have a different behavior: parametric (Linear, ...) and non-parametric (Random Forest, ...) models. If a plateau is reached with a non-parametric that means the rest of the dataset is "useless". As you can see while the Lightning method reaches a plateau very soon on my picture that doesn't mean that the dataset doesn't have anything left to offer but more than that is the best that the method can do. That's why non-parametric methods work the best when the model to get is complicated and can really benefit from a large number of training samples.

So as for your questions :

  1. See above.

  2. Yes, it all depends on what is inside the dataset.

  3. For me, the only rule of thumb is to go with cross-validation. If you are in the situation in which you think that you will use 20,000 or 30,000 samples you're often in a case where cross-validation is not a problem. In my thesis, I computed the accuracy of my methods on a test set, and when I did not notice a significant improvement I determined the number of samples it took to get there. As I said there are some trends that you can observe (parametric methods tend to saturate more quickly than non-parametric)

  4. Sometimes when the dataset is not large enough you can take every datapoint you have and still have room for improvement if you had a larger dataset. In my thesis with no optimisation on the parameters, the Cifar-10 dataset behaved that way, even after 50,000 none of my algorithm had already converged.

I'd add that optimizing the parameters of the algorithms have a big influence on the speed of convergence to a plateau, but it requires another step of cross-validation.

Your last sentence is highly related to the subject of my thesis, but for me, it was more related to the memory and time available for doing the ML tasks. (As if you cover less than the whole dataset you'll have a smaller memory requirement and it will be faster). About that, the concept of "core-sets" could really be interesting for you.

I hope I could help you, I had to stop because I could on and on about that but if you need more clarifications I'd be happy to help.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM