简体   繁体   English

Deeplearning4j为测试和训练分割数据集

[英]Deeplearning4j Splitting datasets for test and train

Deeplearning4j has functions to support splitting datasets into test and train, as well as mechanisms for shuffling datasets, however as far as I can tell either they don't work or I'm doing something wrong. Deeplearning4j具有支持将数据集拆分为测试和训练的功能,以及用于改组数据集的机制,但据我所知,它们不起作用或者我做错了。

Example: 例:

    DataSetIterator iter = new IrisDataSetIterator(150, 150);
    DataSet next = iter.next();
    // next.shuffle();
    SplitTestAndTrain testAndTrain = next.splitTestAndTrain(120, new Random(seed));
    DataSet train = testAndTrain.getTrain();
    DataSet test = testAndTrain.getTest();

    for (int i = 0; i < 30; i++) {
        String features = test.getFeatures().getRow(i).toString();
        String actual = test.getLabels().getRow(i).toString().trim();
        log.info("features " + features + " -> " + actual );
    }

Results in the last 30 rows of the input dataset returned, the Random(seed) parameter to splitTestAndTrain seems to have been ignored completely. 返回的输入数据集的最后30行中的结果,splitTestAndTrain的Random(seed)参数似乎已被完全忽略。

If instead of passing the random seed to splitTestAndTrain I instead uncomment the next.shuffle() line, then oddly the 3rd and 4th features get shuffled while maintaining the existing order for the 1st and 2nd features as well as the test label, which is even worse than not sorting the input at all. 如果不是将随机种子传递给splitTestAndTrain而是取消注释next.shuffle()行,那么奇怪的是第3和第4个特征在保持第1和第2个特征以及测试标签的现有顺序的同时被洗牌,这是均匀的更糟糕的是根本没有对输入进行排序。

So... the question is, am I using it wrong, or is Deeplearning4j just inherently broken? 所以...问题是,我使用它是错误的,还是Deeplearning4j本身就是破碎的?

Bonus question: if Deeplearning4j is broken for something as simple as generating test and sample datasets, should it be trusted with anything at all? 额外的问题:如果Deeplearning4j因为生成测试和样本数据集这样简单而被破坏,那么它是否应该被任何东西信任? Or would I be better off using a different library? 或者我会更好地使用不同的库?

Deeplearning4j assumes that datasets are minibatches, eg: they are not all in memory. Deeplearning4j假设数据集是小批量的,例如:它们并非都在内存中。 This contradicts the python world which might optimize a bit more for smaller datasets and ease of use. 这与python世界相矛盾,python世界可能会为较小的数据集和易用性优化更多。

This only works for toy problems and does not scale well to real problems. 这仅适用于玩具问题,并且不能很好地适应实际问题。 In lieu of this we optimize for the datasetiterator interface for local scenarios (note that this will be different for distributed systems like spark). 代替这一点,我们针对本地场景优化了数据集界面(请注意,对于像spark这样的分布式系统,这将是不同的)。

This means we rely on the datasets either being split before hand using datavec to parse the dataset (hint: do not write your own iterator: use ours and use datavec for custom parsing) or allowing the use of a datasetiterator splitter: https://deeplearning4j.org/doc/org/deeplearning4j/datasets/iterator/DataSetIteratorSplitter.html for train test split. 这意味着我们依赖数据集或者在使用datavec解析数据集之前进行拆分(提示:不要编写自己的迭代器:使用我们的数据集并使用datavec进行自定义解析)或允许使用数据库分类器: https:// deeplearning4j.org/doc/org/deeplearning4j/datasets/iterator/DataSetIteratorSplitter.html用于列车测试拆分。

The dataset split train test class will only work if the dataset is already all in memory but may not make sense for most semi realistic problems (eg: getting beyond xor or mnist) 数据集拆分列车测试类仅在数据集已经全部存储在内存中时才有效,但对于大多数半实际问题可能没有意义(例如:超越xor或mnist)

I recommend running your ETL step once rather than every time. 我建议运行一次ETL步骤,而不是每次都运行。 Preshuffle your dataset in to pre sliced batches. 将数据集打包到预先切片的批次中。 One way to do this is with a combination of: https://github.com/deeplearning4j/deeplearning4j/blob/master/nd4j/nd4j-backends/nd4j-tests/src/test/java/org/nd4j/linalg/dataset/BalanceMinibatchesTest.java#L40 and: https://nd4j.org/doc/org/nd4j/linalg/dataset/ExistingMiniBatchDataSetIterator.html 一种方法是组合: https//github.com/deeplearning4j/deeplearning4j/blob/master/nd4j/nd4j-backends/nd4j-tests/src/test/java/org/nd4j/linalg/dataset / BalanceMinibatchesTest.java#L40和: https//nd4j.org/doc/org/nd4j/linalg/dataset/ExistingMiniBatchDataSetIterator.html

Another reason to do this is reproducibility. 这样做的另一个原因是可重复性。 If you want to do something like shuffle your iterator each epoch, you could try writing some code based on a combination of the above. 如果你想做一些像每个时代的迭代器那样的东西,你可以尝试根据上面的组合编写一些代码。 Either way, I would try to handle your ETL and pre create the vectors before you do training. 无论哪种方式,我都会尝试处理您的ETL并在训练之前预先创建向量。 Otherwise, you're spending a lot of time on data loading on larger datasets. 否则,您将花费大量时间在较大数据集上加载数据。

As far as I can tell, deeplearning4j is simply broken. 据我所知,deeplearning4j被打破了。 Ultimately I created my own implementation of splitTestandTrain. 最终我创建了自己的splitTestandTrain实现。

import org.nd4j.linalg.api.ndarray.INDArray;
import org.nd4j.linalg.dataset.DataSet;
import java.util.Random;
import org.nd4j.linalg.factory.Nd4j;

public class TestTrain {  
    protected DataSet test;
    protected DataSet train;

    public TestTrain(DataSet input, int splitSize, Random rng) {
        int inTest = 0;
        int inTrain = 0;
        int testSize = input.numExamples() - splitSize;

        INDArray train_features = Nd4j.create(splitSize, input.getFeatures().columns());
        INDArray train_outcomes = Nd4j.create(splitSize, input.numOutcomes());
        INDArray test_features  = Nd4j.create(testSize, input.getFeatures().columns());
        INDArray test_outcomes  = Nd4j.create(testSize, input.numOutcomes());

        for (int i = 0; i < input.numExamples(); i++) {
            DataSet D = input.get(i);
            if (rng.nextDouble() < (splitSize-inTrain)/(double)(input.numExamples()-i)) {
                train_features.putRow(inTrain, D.getFeatures());
                train_outcomes.putRow(inTrain, D.getLabels());
                inTrain += 1;
            } else {
                test_features.putRow(inTest, D.getFeatures());
                test_outcomes.putRow(inTest, D.getLabels());
                inTest += 1;
            }
        }

        train = new DataSet(train_features, train_outcomes);
        test  = new DataSet(test_features, test_outcomes);
    }

    public DataSet getTrain() {
        return train;
    }

    public DataSet getTest() {
        return test;
    }
}

This works, but it does not give me confidence in the library. 这有效,但它并没有让我对图书馆充满信心。 Still happy if someone else can provide a better answer, but for now this will have to do. 如果其他人可以提供更好的答案仍然很高兴,但现在必须这样做。

As this question is outdated, for people that might find this, you can see some examples on GitHub , the split can be done in a simple way: 由于这个问题已经过时,对于可能会发现这个问题的人来说,您可以在GitHub上看到一些示例,可以通过简单的方式完成拆分:

DataSetIterator iterator = new RecordReaderDataSetIterator(recordReader,batchSize,labelIndex,numClasses);
DataSet allData = iterator.next();
allData.shuffle();
SplitTestAndTrain testAndTrain = allData.splitTestAndTrain(0.65);  //Use 65% of data for training

DataSet trainingData = testAndTrain.getTrain();
DataSet testData = testAndTrain.getTest();

Were first you create the iterator, iterate over all the data, shuffle it and the split between the test and train. 首先,您创建迭代器,迭代所有数据,将其洗牌以及测试和训练之间的分割。

This is taken from this example 这是从这个例子中得到的

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM