Feeding lots of observations with many binary data features to TensorFlow

Question

I have about 1.7 million observations. Each of them has about 4000 boolean features and 4 floating point labels/targets. The features are sparse and approximately homogeneously distributed (about 150 of the 4000 boolean values are set to True per observation).

If I store the whole (1700000, 4000) matrix as raw numpy file ( npz format), it takes about 100 MB of disk space. If I load it via np.load() , it takes a few minutes and my RAM usage rises by about 7 GB, which is fine on its own.

The problem is, that I have to turn over my boolean values in a feed_dict to a tf.placeholder in order for the tf.data.Dataset to be able to use it. This process takes another 7 GB of RAM. My plan is to collect even more data in the future (might become more than 10 million observations at some point).

Question: So how can I feed the data to my DNN (Feed-Forward, dense, not convolutional and not recurrent) without creating a bottle-neck and in a way that is native to TensorFlow? I would have thought that this is a pretty standard setting and many people should have that problem – why not? What do I do wrong/different than people without the problem?

I heard the tfrecord format is well integrated with TensorFlow and is able to load lazy but I think it is a bad idea to use that format for my feature structure as it creates one Message per observation and saves the features as map with the keys of all features as string per observation .

Answer 1

I've found a solution, called tf.data.Dataset.from_generator .

This basically does the trick:

def generate_train_data(self, batch_size: int) -> typing.Iterable[typing.Tuple[np.ndarray, np.ndarray]]:
    row_id = 0
    features = self.get_features()
    targets = self.get_targets()
    test_amount = self.get_test_data_amount()
    while row_id < features.shape[0]:
        limit = min(features.shape[0] - test_amount, row_id + batch_size)
        feature_batch = features[row_id:limit, :]
        target_batch = targets[row_id:limit, :]
        yield (feature_batch, target_batch)
        del feature_batch, target_batch
        row_id += batch_size

And to create the tf.data.Dataset something like:

train_data = tf.data.Dataset.from_generator(
    data.generate_train_data,
    output_types=(tf.bool, tf.bool),
    output_shapes=(
        (None, data.get_feature_amount()),
        (None, data.get_target_amount()),
    ),
    args=(batch_size,),
).repeat()

This does of course not shuffle the data yet but that'd be extremely easy to retrofit…

Feeding lots of observations with many binary data features to TensorFlow

Question

1 answers

solution1
0 ACCPTED 2018-12-20 17:18:45

Feeding lots of observations with many binary data features to TensorFlow

Question

1 answers

solution1 0 ACCPTED 2018-12-20 17:18:45

solution1
0 ACCPTED 2018-12-20 17:18:45