简体   繁体   English

如何将 Tensorflow 数据集保存到文件中?

[英]How do you save a Tensorflow dataset to a file?

There are at least two more questions like this on SO but not a single one has been answered. SO上至少还有两个这样的问题,但没有一个得到回答。

I have a dataset of the form:我有一个形式的数据集:

<TensorSliceDataset shapes: ((512,), (512,), (512,), ()), types: (tf.int32, tf.int32, tf.int32, tf.int32)>

and another of the form:和另一种形式:

<BatchDataset shapes: ((None, 512), (None, 512), (None, 512), (None,)), types: (tf.int32, tf.int32, tf.int32, tf.int32)>

I have looked and looked but I can't find the code to save these datasets to files that can be loaded later.我看了又看,但找不到将这些数据集保存到以后可以加载的文件的代码。 The closest I got was this page in the TensorFlow docs , which suggests serializing the tensors using tf.io.serialize_tensor and then writing them to a file using tf.data.experimental.TFRecordWriter .我得到的最接近的是 TensorFlow 文档中的这个页面,它建议使用tf.data.experimental.TFRecordWriter序列化张量,然后使用tf.io.serialize_tensor将它们写入文件。

However, when I tried this using the code:但是,当我使用代码尝试此操作时:

dataset.map(tf.io.serialize_tensor)
writer = tf.data.experimental.TFRecordWriter('mydata.tfrecord')
writer.write(dataset)

I get an error on the first line:我在第一行得到一个错误:

TypeError: serialize_tensor() takes from 1 to 2 positional arguments but 4 were given TypeError:serialize_tensor() 从 1 到 2 个位置 arguments 但给出了 4 个

How can I modify the above (or do something else) to accomplish my goal?如何修改上述内容(或做其他事情)以实现我的目标?

An incident was open on GitHUb and it appears there's a new feature available in TF 2.3 to write to disk: GitHUb 上出现了一个事件,并且似乎 TF 2.3 中有一个新功能可用于写入磁盘:

https://www.tensorflow.org/api_docs/python/tf/data/experimental/save https://www.tensorflow.org/api_docs/python/tf/data/experimental/load https://www.tensorflow.org/api_docs/python/tf/data/experimental/save https://www.tensorflow.org/api_docs/python/tf/data/experimental/load

I haven't tested this features yet but it seems to be doing what you want.我还没有测试过这个功能,但它似乎正在做你想做的事。

TFRecordWriter seems to be the most convenient option, but unfortunately it can only write datasets with a single tensor per element. TFRecordWriter似乎是最方便的选择,但不幸的是,它只能编写每个元素只有一个张量的数据集。 Here are a couple of workarounds you can use.您可以使用以下几种解决方法。 First, since all your tensors have the same type and similar shape, you can concatenate them all into one, and split them back later on load:首先,由于所有张量都具有相同的类型和相似的形状,因此您可以将它们全部连接成一个,然后在加载时将它们拆分回来:

import tensorflow as tf

# Write
a = tf.zeros((100, 512), tf.int32)
ds = tf.data.Dataset.from_tensor_slices((a, a, a, a[:, 0]))
print(ds)
# <TensorSliceDataset shapes: ((512,), (512,), (512,), ()), types: (tf.int32, tf.int32, tf.int32, tf.int32)>
def write_map_fn(x1, x2, x3, x4):
    return tf.io.serialize_tensor(tf.concat([x1, x2, x3, tf.expand_dims(x4, -1)], -1))
ds = ds.map(write_map_fn)
writer = tf.data.experimental.TFRecordWriter('mydata.tfrecord')
writer.write(ds)

# Read
def read_map_fn(x):
    xp = tf.io.parse_tensor(x, tf.int32)
    # Optionally set shape
    xp.set_shape([1537])  # Do `xp.set_shape([None, 1537])` if using batches
    # Use `x[:, :512], ...` if using batches
    return xp[:512], xp[512:1024], xp[1024:1536], xp[-1]
ds = tf.data.TFRecordDataset('mydata.tfrecord').map(read_map_fn)
print(ds)
# <MapDataset shapes: ((512,), (512,), (512,), ()), types: (tf.int32, tf.int32, tf.int32, tf.int32)>

But, more generally, you can simply have a separate file per tensor and then read them all:但是,更一般地说,您可以简单地为每个张量创建一个单独的文件,然后将它们全部读取:

import tensorflow as tf

# Write
a = tf.zeros((100, 512), tf.int32)
ds = tf.data.Dataset.from_tensor_slices((a, a, a, a[:, 0]))
for i, _ in enumerate(ds.element_spec):
    ds_i = ds.map(lambda *args: args[i]).map(tf.io.serialize_tensor)
    writer = tf.data.experimental.TFRecordWriter(f'mydata.{i}.tfrecord')
    writer.write(ds_i)

# Read
NUM_PARTS = 4
parts = []
def read_map_fn(x):
    return tf.io.parse_tensor(x, tf.int32)
for i in range(NUM_PARTS):
    parts.append(tf.data.TFRecordDataset(f'mydata.{i}.tfrecord').map(read_map_fn))
ds = tf.data.Dataset.zip(tuple(parts))
print(ds)
# <ZipDataset shapes: (<unknown>, <unknown>, <unknown>, <unknown>), types: (tf.int32, tf.int32, tf.int32, tf.int32)>

It is possible to have the whole dataset in a single file with multiple separate tensors per element, namely as a file of TFRecords containing tf.train.Example s, but I don't know if there is a way to create those within TensorFlow, that is, without having to get the data out of the dataset into Python and then write it to the records file.可以将整个数据集放在一个文件中,每个元素有多个单独的张量,即作为包含tf.train.Example的 TFRecords 文件,但我不知道是否有办法在 TensorFlow 中创建它们,也就是说,无需将数据从数据集中取出到 Python 中,然后将其写入记录文件。

To add on Yoan's answer:要添加 Yoan 的答案:

the tf.experimental.save() and load() API works well. tf.experimental.save() 和 load() API 运行良好。 You also need to MANUALLY save the ds.element_spec to disk to be able to load() later / within a different context.您还需要手动将 ds.element_spec 保存到磁盘,以便稍后/在不同的上下文中加载()。

Pickling works well for me:酸洗对我很有效:

1- Saving: 1- 保存:

tf.data.experimental.save(
    ds, tf_data_path, compression='GZIP'
)
with open(tf_data_path + '/element_spec', 'wb') as out_:  # also save the element_spec to disk for future loading
    pickle.dump(ds.element_spec, out_)

2- For loading, you need both the folder path with the tf shards and the element_spec that we manually pickled 2-对于加载,您需要包含 tf 分片的文件夹路径和我们手动腌制的 element_spec

with open(tf_data_path + '/element_spec', 'rb') as in_:
    es = pickle.load(in_)

loaded = tf.data.experimental.load(
    tf_data_path, es, compression='GZIP'
)

I have been working on this issus as well and so far I have written the following util (as to be found in my repo as well )我也一直在研究这个问题,到目前为止,我已经编写了以下实用程序(也可以在我的仓库中找到)

def cache_with_tf_record(filename: Union[str, pathlib.Path]) -> Callable[[tf.data.Dataset], tf.data.TFRecordDataset]:
    """
    Similar to tf.data.Dataset.cache but writes a tf record file instead. Compared to base .cache method, it also insures that the whole
    dataset is cached
    """

    def _cache(dataset):
        if not isinstance(dataset.element_spec, dict):
            raise ValueError(f"dataset.element_spec should be a dict but is {type(dataset.element_spec)} instead")
        Path(filename).parent.mkdir(parents=True, exist_ok=True)
        with tf.io.TFRecordWriter(str(filename)) as writer:
            for sample in dataset.map(transform(**{name: tf.io.serialize_tensor for name in dataset.element_spec.keys()})):
                writer.write(
                    tf.train.Example(
                        features=tf.train.Features(
                            feature={
                                key: tf.train.Feature(bytes_list=tf.train.BytesList(value=[value.numpy()]))
                                for key, value in sample.items()
                            }
                        )
                    ).SerializeToString()
                )
        return (
            tf.data.TFRecordDataset(str(filename), num_parallel_reads=tf.data.experimental.AUTOTUNE)
            .map(
                partial(
                    tf.io.parse_single_example,
                    features={name: tf.io.FixedLenFeature((), tf.string) for name in dataset.element_spec.keys()},
                ),
                num_parallel_calls=tf.data.experimental.AUTOTUNE,
            )
            .map(
                transform(
                    **{name: partial(tf.io.parse_tensor, out_type=spec.dtype) for name, spec in dataset.element_spec.items()}
                )
            )
            .map(
                transform(**{name: partial(tf.ensure_shape, shape=spec.shape) for name, spec in dataset.element_spec.items()})
            )
        )

    return _cache

With this util, I can do:有了这个工具,我可以做到:

dataset.apply(cache_with_tf_record("filename")).map(...)

and also load directly the dataset for later use with only the second part of the util.并且还直接加载数据集以供以后仅使用 util 的第二部分使用。

I am still working on it so it may change later on, especially to serialize with the correct types instead of all bytes to save space (I guess).我仍在研究它,因此它可能会在以后更改,特别是使用正确的类型而不是所有字节进行序列化以节省空间(我猜)。

You can use tf.data.experimental.save and tf.data.experimental.load like this:您可以像这样使用tf.data.experimental.savetf.data.experimental.load

Code to save it:保存它的代码:

tf_dataset = get_dataset()    # returns a tf.data.Dataset() file
tf.data.experimental.save(dataset=tf_dataset, path="path/to/desired/save/file_name")
with open("path/to/desired/save/file_name" + ".pickle")), 'wb') as file:
    pickle.dump(tf_dataset.element_spec, file)   # I need this for opening it later

Code to open:打开代码:

element_spec = pickle.load("path/to/desired/save/file_name" + ".pickle", 'rb'))
tensor_data = tf.data.experimental.load("path/to/desired/save/file_name", element_spec=element_spec)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM