简体繁体 English

Julia ML：是否存在用于将数据加载到Flux，Knet，深度学习库的推荐数据格式

[英]Julia ML: Is there a recommended data format for loading data to Flux, Knet, Deep Learning Libraries

原文 2018-12-28 20:08:44 9 1 machine-learning/ julia/ flux-machine-learning

I use Tensorflow for deep learning work, but I was interested in some of the features of Julia for ML. 我使用Tensorflow进行深度学习，但是我对Julia for ML的某些功能感兴趣。 Now in Tensorflow, there is a clear standard that protocol buffers--meaning TFRecords format is the best way to load sizable datasets to the GPUs for model training. 现在在Tensorflow中，有一个明确的协议缓冲标准-意味着TFRecords格式是将可观数据集加载到GPU进行模型训练的最佳方法。 I have been reading the Flux, KNET, documentation as well as other forum posts looking to see if there is any particular recommendation on the most efficient data format. 我一直在阅读Flux，KNET，文档以及其他论坛帖子，以了解是否对最有效的数据格式有任何特别的建议。 But I have not found one. 但是我还没有找到。

My question is, is there a recommended data format for the Julia ML libraries to facilitate training? 我的问题是，Julia ML库是否有推荐的数据格式以方便培训？ In other words, are there any clear dataset formats that I should avoid because of bad performance? 换句话说，是否有由于性能不佳而应避免使用的清晰数据集格式？

Now, I know that there is a Protobuf.jl library so users can still use protocol buffers. 现在，我知道有一个Protobuf.jl库，因此用户仍然可以使用协议缓冲区。 I was planning to use protocol buffers for now, since I can then use the same data format for Tensorflow and Julia. 我当时正计划使用协议缓冲区，因为那时我可以为Tensorflow和Julia使用相同的数据格式。 However, I also found this interesting Reddit post about how the user is not using protocol buffers and just using straight Julia Vectors. 但是，我也发现了一篇有趣的Reddit帖子，内容涉及用户如何不使用协议缓冲区，而只是使用平直的Julia Vectors。

https://www.reddit.com/r/MachineLearning/comments/994dl7/d_hows_julia_language_mit_for_ml/ https://www.reddit.com/r/MachineLearning/comments/994dl7/d_hows_julia_language_mit_for_ml/

I get that the Julia ML libraries are likely data storage format agnostic. 我发现Julia ML库很可能与数据存储格式无关。 Meaning that no matter what format in which the data is stored, the data gets decoded to some sort of vector or matrix format anyway. 这意味着无论以何种格式存储数据，数据都将被解码为某种矢量或矩阵格式。 So in that case I can use whatever format. 因此，在那种情况下，我可以使用任何格式。 But just wanted to make sure I did not miss anything in the documentation or such about problems or low performance due to using the wrong data storage format. 但是只是想确保我不会错过任何文档，也不会因为使用错误的数据存储格式而错过任何有关问题或性能下降的信息。

1 个解决方案

For in-memory use just use arrays and vectors. 对于内存使用，只需使用数组和向量。 They're just big contiguous lumps of memory with some metadata. 它们只是带有一些元数据的连续大块内存。 You can't really get any better than that. 真的没有比这更好的了。

For serializing to another Julia process, Julia will handle that for you and use the stdlib Serialization module. 为了序列化到另一个Julia进程，Julia将为您处理并使用stdlib序列化模块。

For serializing to disk you should either Just use Serialization.serialize (possibly compressed) or, if you think you might need to read from another program or if you think you'll change Julia version before you're done with the data you can use BSON.jl or Feather.jl. 对于序列化到磁盘，您应该只使用Serialization.serialize（可能是压缩的），或者如果您认为可能需要从另一个程序读取，或者如果您认为在处理完数据之后要更改Julia版本，则可以使用BSON.jl或Feather.jl。

In the near future, JLSO.jl will be a good option for replacing Serialization. 在不久的将来，JLSO.jl将是取代序列化的一个不错的选择。