向TensorFlow提供具有许多二进制数据特征的大量观测值

Question

I have about 1.7 million observations. 我有大约170万个观测值。 Each of them has about 4000 boolean features and 4 floating point labels/targets. 它们每个都有大约4000个布尔特征和4个浮点标签/目标。 The features are sparse and approximately homogeneously distributed (about 150 of the 4000 boolean values are set to True per observation). 特征稀疏且近似均匀分布（每个观测值将4000个布尔值中的约150个设置为True ）。

If I store the whole (1700000, 4000) matrix as raw numpy file ( npz format), it takes about 100 MB of disk space. 如果我将整个(1700000, 4000) npz (1700000, 4000)矩阵存储为原始numpy文件（ npz格式），则大约需要100 MB的磁盘空间。 If I load it via np.load() , it takes a few minutes and my RAM usage rises by about 7 GB, which is fine on its own. 如果我通过np.load()加载它，则需要花费几分钟，并且我的RAM使用量会增加约7 GB，这本身就可以了。

The problem is, that I have to turn over my boolean values in a feed_dict to a tf.placeholder in order for the tf.data.Dataset to be able to use it. 问题是，我必须将feed_dict中的布尔值feed_dict给tf.placeholder才能使tf.data.Dataset能够使用它。 This process takes another 7 GB of RAM. 此过程需要另外7 GB的RAM。 My plan is to collect even more data in the future (might become more than 10 million observations at some point). 我的计划是将来收集更多数据（有时可能会超过1000万次观测）。

Question: So how can I feed the data to my DNN (Feed-Forward, dense, not convolutional and not recurrent) without creating a bottle-neck and in a way that is native to TensorFlow? 问题：那么如何将数据馈送到DNN（前馈，密集，非卷积和非递归）而又不产生瓶颈，而采用TensorFlow固有的方式？ I would have thought that this is a pretty standard setting and many people should have that problem – why not? 我以为这是一个非常标准的设置，很多人应该有这个问题-为什么不呢？ What do I do wrong/different than people without the problem? 与没有问题的人相比，我做错什么/与别人不同？

I heard the tfrecord format is well integrated with TensorFlow and is able to load lazy but I think it is a bad idea to use that format for my feature structure as it creates one Message per observation and saves the features as map with the keys of all features as string per observation . 我听说tfrecord格式与TensorFlow 很好地集成在一起并且能够延迟加载，但是我认为对我的要素结构使用该格式是一个坏主意，因为它会为每次观察创建一个Message ，并将要素保存为带有所有键的map 每个观察的特征是字符串。

Answer 1

I've found a solution, called tf.data.Dataset.from_generator . 我找到了一个名为tf.data.Dataset.from_generator的解决方案。

This basically does the trick: 这基本上可以解决问题：

def generate_train_data(self, batch_size: int) -> typing.Iterable[typing.Tuple[np.ndarray, np.ndarray]]:
    row_id = 0
    features = self.get_features()
    targets = self.get_targets()
    test_amount = self.get_test_data_amount()
    while row_id < features.shape[0]:
        limit = min(features.shape[0] - test_amount, row_id + batch_size)
        feature_batch = features[row_id:limit, :]
        target_batch = targets[row_id:limit, :]
        yield (feature_batch, target_batch)
        del feature_batch, target_batch
        row_id += batch_size

And to create the tf.data.Dataset something like: 并创建如下的tf.data.Dataset ：

train_data = tf.data.Dataset.from_generator(
    data.generate_train_data,
    output_types=(tf.bool, tf.bool),
    output_shapes=(
        (None, data.get_feature_amount()),
        (None, data.get_target_amount()),
    ),
    args=(batch_size,),
).repeat()

This does of course not shuffle the data yet but that'd be extremely easy to retrofit… 当然，这还不会随机整理数据，但是翻新起来非常容易……

向TensorFlow提供具有许多二进制数据特征的大量观测值

问题描述

1 个解决方案

解决方案1
0 已采纳 2018-12-20 17:18:45

向TensorFlow提供具有许多二进制数据特征的大量观测值

问题描述

1 个解决方案

解决方案1 0 已采纳 2018-12-20 17:18:45

解决方案1
0 已采纳 2018-12-20 17:18:45