简体   繁体   English

Tensorflow分区的csv input_fn

[英]Tensorflow partitioned csv input_fn

The problem, to sum it up, is that I have the data I want to use for training split into a lot of smaller csvs( feat-01.csv , feat-02.csv , etc). 总而言之,问题是我将要用于训练的数据分成了许多较小的csvs( feat-01.csvfeat-02.csv等)。 I am trying to feed these to an Estimator , more exactly to do this via some sort of input_fn . 我正在尝试将它们提供给Estimator ,更确切地说,是通过某种input_fn做到这input_fn

My ideal solution would've been to have some sort of input function that takes a dask.Dataframe (which is pretty much how I generated my data until now) and batch it to the estimator. 我理想的解决方案是使用某种输入函数,它需要一个dask.Dataframe (这是dask.Dataframe我生成数据的方式)并将其批处理到估计器中。 I tried something along these lines: 我尝试了以下方法:

import tensorflow as tf
import dask.dataframe as dd

ddf = dd.read_csv('feat-*.csv')
tf.contrib.learn.extract_dask_data(ddf)

However this fails with: 但是,此操作失败:

TypeError: Expected `meta` to specify type DataFrame, got type Index

I kinda gave up on this idea due to the lack of documentation about using dask directly, although some docstrings seem to point out that it should be possible. 由于缺少有关直接使用dask的文档,我有点放弃了这个想法,尽管一些文档字符串似乎指出了应该这样做。 I was thinking about making an input_fn to feed it directly from the csv files, but I found no specific examples about this usecase either. 我当时正在考虑制作一个input_fn以直接从csv文件中馈送它,但是我也没有找到有关此用例的具体示例。

Being a bit of TF noob, I was wondering what the cleanest method to accomplish is. 作为TF新手,我想知道最干净的方法是什么。

UPDATE: After trying fruitlessly to implement it via dask I gave up on the idea both from frustration and because the overhead might be a little to much. 更新:在尝试通过dask来实现它毫无结果之后,我从沮丧中放弃了这个主意,因为开销可能有点大。

I implemented an input function using tf's queues with pretty good results. 我使用tf的队列实现了输入功能,效果很好。 Here is the code . 这是代码 Although slightly more complicated than what I had in mind with simply passing dataframes to the the estimator, but doing all the work inside tensorflow seems the most elegant approach. 尽管比简单地将数据帧传递给估计器要复杂得多,但是在tensorflow内部进行所有工作似乎是最优雅的方法。

FINAL UPDATE: Shortly after I posted this question tensorlfow 1.4 was released, and with it the dataset API was officially supported and better documented. 最终更新:在我发布此问题后不久,tensorlfow 1.4便发布了,并正式支持了数据集API并对其进行了更好的记录。 If anyone is still interested in this question, I advise you to check out this paragraph from the TF documentation. 如果仍然有人对此问题感兴趣,建议您从TF文档中查看本段

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM