如何在 Deeplearning4j 中使用自定义数据模型？

Question

The base problem is trying to use a custom data model to create a DataSetIterator to be used in a deeplearning4j network.基本问题是尝试使用自定义数据模型来创建要在deeplearning4j网络中使用的DataSetIterator 。

The data model I am trying to work with is a java class that holds a bunch of doubles, created from quotes on a specific stock, such as timestamp, open, close, high, low, volume, technical indicator 1, technical indicator 2, etc. I query an internet source, example , (also several other indicators from the same site) which provide json strings that I convert into my data model for easier access and to store in an sqlite database.我尝试使用的数据模型是一个 java 类，它包含一堆双精度值，由特定股票的报价创建，例如时间戳、开盘价、收盘价、最高价、最低价、成交量、技术指标 1、技术指标 2、等等。我查询了一个互联网资源，例如，（还有来自同一站点的其他几个指标），它提供了 json 字符串，我将其转换为我的数据模型以便于访问并存储在 sqlite 数据库中。

Now I have a List of these data models that I would like to use to train an LSTM network, each double being a feature.现在我有这些数据模型的列表，我想用它来训练 LSTM 网络，每个模型都是一个特征。 Per the Deeplearning4j documentation and several examples, the way to use training data is to use the ETL processes described here to create a DataSetIterator which is then used by the network.根据 Deeplearning4j 文档和几个示例，使用训练数据的方法是使用此处描述的 ETL 过程创建一个 DataSetIterator，然后供网络使用。

I don't see a clean way to convert my data model using any of the provided RecordReaders without first converting them to some other format, such as a CSV or other file.我没有看到使用任何提供的 RecordReaders 转换我的数据模型的干净方法，而无需先将它们转换为其他格式，例如 CSV 或其他文件。 I would like to avoid this because it would use up a lot of resources.我想避免这种情况，因为它会消耗大量资源。 It seems like there would be a better way to do this simple case.似乎有更好的方法来处理这个简单的案例。 Is there a better approach that I am just missing?有没有更好的方法我只是想念？

Answer 1

Ethan!伊森！

First of all, Deeplearning4j uses ND4j as backend, so your data will have to eventually be converted into INDArray objects in order to be used in your model.首先，Deeplearning4j 使用 ND4j 作为后端，因此您的数据最终必须转换为INDArray对象才能在您的模型中使用。 If your trianing data is two array of doubles, inputsArray and desiredOutputsArray , you can do the following:如果您的三角数据是两个双精度数组， inputsArray数组和desiredOutputsArray ，您可以执行以下操作：

INDArray inputs = Nd4j.create(inputsArray, new int[]{numSamples, inputDim});
INDArray desiredOutputs = Nd4j.create(desiredOutputsArray, new int[]{numSamples, outputDim});

And then you can train your model using those vectors directly:然后你可以直接使用这些向量训练你的模型：

for (int epoch = 0; epoch < nEpochs; epoch++)
    model.fit(inputs, desiredOutputs);

Alternatively you can create a DataSet object and used it for training:或者，您可以创建一个DataSet对象并将其用于训练：

DataSet ds = new DataSet(inputs, desiredOutputs);
for (int epoch = 0; epoch < nEpochs; epoch++)
    model.fit(ds);

But creating a custom iterator is the safest approach, specially in larger sets since it gives you more control over your data and keep things organized.但是创建自定义迭代器是最安全的方法，特别是在较大的集合中，因为它可以让您更好地控制数据并使事情井井有条。

In your DataSetIterator implementation you must pass your data and in the implementation of the next() method you should return a DataSet object comprising the next batch of your training data.在您的DataSetIterator实现中，您必须传递您的数据，并且在next()方法的实现中，您应该返回一个包含下一批训练数据的DataSet对象。 It would look like this:它看起来像这样：

public class MyCustomIterator implements DataSetIterator {
    private INDArray inputs, desiredOutputs;
    private int itPosition = 0; // the iterator position in the set.

    public MyCustomIterator(float[] inputsArray,
                            float[] desiredOutputsArray,
                            int numSamples,
                            int inputDim,
                            int outputDim) {
        inputs = Nd4j.create(inputsArray, new int[]{numSamples, inputDim});
        desiredOutputs = Nd4j.create(desiredOutputsArray, new int[]{numSamples, outputDim});
    }

    public DataSet next(int num) {
        // get a view containing the next num samples and desired outs.
        INDArray dsInput = inputs.get(
            NDArrayIndex.interval(itPosition, itPosition + num),
            NDArrayIndex.all());
        INDArray dsDesired = desiredOutputs.get(
            NDArrayIndex.interval(itPosition, itPosition + num),
            NDArrayIndex.all());

        itPosition += num;

        return new DataSet(dsInput, dsDesired);
    }

    // implement the remaining virtual methods...

}

The NDArrayIndex methods you see above are used to access parts of a INDArray .您在上面看到的NDArrayIndex方法用于访问INDArray 。 Then now you can use it for training:然后现在您可以将其用于训练：

MyCustomIterator it = new MyCustomIterator(
    inputs,
    desiredOutputs,
    numSamples,
    inputDim,
    outputDim);

for (int epoch = 0; epoch < nEpochs; epoch++)
    model.fit(it);

This example will be particularly useful to you, since it implements a LSTM network and it has a custom iterator implementation (which can be a guide for implementing the remaining methods). 这个例子对你特别有用，因为它实现了一个 LSTM 网络并且它有一个自定义的迭代器实现（它可以作为实现其余方法的指南）。 Also, for more information on NDArray , this is helpful.此外，有关NDArray更多信息，这很有帮助。 It gives detailed information on creating, modifying and accessing parts of an NDArray .它提供了有关创建、修改和访问NDArray部分的详细信息。

Answer 2

deeplearning4j creator here. deeplearning4j 创建者在这里。

You should not in any but all very special setting create a data set iterator.您不应该在任何非常特殊的设置中创建数据集迭代器。 You should be using datavec.您应该使用 datavec。 We cover this in numerous places ranging from our data vec page to our examples: https://deeplearning4j.konduit.ai/datavec/overview https://github.com/eclipse/deeplearning4j-examples我们在很多地方都涵盖了这一点，从我们的数据 vec 页面到我们的示例： https : //deeplearning4j.konduit.ai/datavec/overview https://github.com/eclipse/deeplearning4j-examples

Datavec is our dedicated library for doing data transformations. Datavec 是我们用于进行数据转换的专用库。 You create custom record readers for your use case.您可以为您的用例创建自定义记录阅读器。 Deeplearning4j for legacy reasons has a few "special" iterators for certain datasets.出于遗留原因，Deeplearning4j 为某些数据集提供了一些“特殊”迭代器。 Many of those came before datavec existed.其中许多是在 datavec 存在之前出现的。 We built datavec as a way of pre processing data.我们构建了 datavec 作为预处理数据的一种方式。

Now you use the RecordReaderDataSetIterator, SequenceRecordReaderDataSetIterator (see our javadoc for more information) and their multi dataset equivalents.现在您使用 RecordReaderDataSetIterator、SequenceRecordReaderDataSetIterator（有关更多信息，请参阅我们的 javadoc）及其多数据集等效项。

If you do this, you don't have to worry about masking, thread safety, or anything else that involves fast loading of data.如果您这样做，您就不必担心屏蔽、线程安全或其他任何涉及快速加载数据的问题。

As an aside, I would love to know where you are getting the idea to create your own iterator, we now have it right in our readme not to do that.顺便说一句，我很想知道您是从哪里得到创建自己的迭代器的想法的，我们现在在自述文件中规定不要这样做。 If there's another place you were looking that is not obvious, we would love to fix that.如果您正在寻找的其他地方不明显，我们很乐意解决该问题。

Edit: I've updated the links to the new pages.编辑：我已经更新了新页面的链接。 This post is very old now.这个帖子现在很老了。 Please see the new links here:请在此处查看新链接：

https://deeplearning4j.konduit.ai/datavec/overview https://github.com/eclipse/deeplearning4j-examples https://deeplearning4j.konduit.ai/datavec/overview https://github.com/eclipse/deeplearning4j-examples

如何在 Deeplearning4j 中使用自定义数据模型？

问题描述

2 个解决方案

解决方案1
9 已采纳 2018-03-04 17:09:28

解决方案2
5 2018-03-04 23:56:24

如何在 Deeplearning4j 中使用自定义数据模型？

问题描述

2 个解决方案

解决方案1 9 已采纳 2018-03-04 17:09:28

解决方案2 5 2018-03-04 23:56:24

解决方案1
9 已采纳 2018-03-04 17:09:28

解决方案2
5 2018-03-04 23:56:24