简体   繁体   中英

Feeding lots of observations with many binary data features to TensorFlow

I have about 1.7 million observations. Each of them has about 4000 boolean features and 4 floating point labels/targets. The features are sparse and approximately homogeneously distributed (about 150 of the 4000 boolean values are set to True per observation).

If I store the whole (1700000, 4000) matrix as raw numpy file ( npz format), it takes about 100 MB of disk space. If I load it via np.load() , it takes a few minutes and my RAM usage rises by about 7 GB, which is fine on its own.

The problem is, that I have to turn over my boolean values in a feed_dict to a tf.placeholder in order for the tf.data.Dataset to be able to use it. This process takes another 7 GB of RAM. My plan is to collect even more data in the future (might become more than 10 million observations at some point).

Question: So how can I feed the data to my DNN (Feed-Forward, dense, not convolutional and not recurrent) without creating a bottle-neck and in a way that is native to TensorFlow? I would have thought that this is a pretty standard setting and many people should have that problem – why not? What do I do wrong/different than people without the problem?


I heard the tfrecord format is well integrated with TensorFlow and is able to load lazy but I think it is a bad idea to use that format for my feature structure as it creates one Message per observation and saves the features as map with the keys of all features as string per observation .

I've found a solution, called tf.data.Dataset.from_generator .

This basically does the trick:

def generate_train_data(self, batch_size: int) -> typing.Iterable[typing.Tuple[np.ndarray, np.ndarray]]:
    row_id = 0
    features = self.get_features()
    targets = self.get_targets()
    test_amount = self.get_test_data_amount()
    while row_id < features.shape[0]:
        limit = min(features.shape[0] - test_amount, row_id + batch_size)
        feature_batch = features[row_id:limit, :]
        target_batch = targets[row_id:limit, :]
        yield (feature_batch, target_batch)
        del feature_batch, target_batch
        row_id += batch_size

And to create the tf.data.Dataset something like:

train_data = tf.data.Dataset.from_generator(
    data.generate_train_data,
    output_types=(tf.bool, tf.bool),
    output_shapes=(
        (None, data.get_feature_amount()),
        (None, data.get_target_amount()),
    ),
    args=(batch_size,),
).repeat()

This does of course not shuffle the data yet but that'd be extremely easy to retrofit…

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM